[
https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857015#comment-15857015
]
Herman van Hovell commented on SPARK-19503:
-------------------------------------------
We could prune sort and distribute operators at some point. They are currently
left intact because in some cases an advanced user wants to force a certain
physical layout. The downside is that when someone does something 'dumb', the
result will be a very bad query plan.
> Execution Plan Optimizer: avoid sort or shuffle when it does not change end
> result such as df.sort(...).count()
> ---------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-19503
> URL: https://issues.apache.org/jira/browse/SPARK-19503
> Project: Spark
> Issue Type: Bug
> Components: Optimizer
> Affects Versions: 2.1.0
> Environment: Perhaps only a pyspark or databricks AWS issue
> Reporter: R
> Priority: Minor
> Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not
> required here and makes me wonder how smart the algebraic optimiser is
> indeed! The data may be partitioned by known count (such as parquet files)
> and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder
> what else is it missing especially in more complex operations.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]