[
https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553434#comment-16553434
]
Apache Spark commented on SPARK-23957:
--------------------------------------
User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/21853
> Sorts in subqueries are redundant and can be removed
> ----------------------------------------------------
>
> Key: SPARK-23957
> URL: https://issues.apache.org/jira/browse/SPARK-23957
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.0
> Reporter: Henry Robinson
> Priority: Major
>
> Unless combined with a {{LIMIT}}, there's no correctness reason that planned
> and optimized subqueries should have any sort operators (since the result of
> the subquery is an unordered collection of tuples).
> For example:
> {{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
> has the following plan:
> {code:java}
> == Physical Plan ==
> *(3) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
> +- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
> +- *(2) Project
> +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
> +- *(1) Project [id#0L]
> +- *(1) FileScan parquet [id#0L] Batched: true, Format:
> Parquet, Location:
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters:
> [], PushedFilters: [], ReadSchema: struct<id:bigint>
> {code}
> ... but the sort operator is redundant.
> Less intuitively, the sort is also redundant in selections from an ordered
> subquery:
> {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
> has plan:
> {code:java}
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
> +- *(1) Project [id#0L]
> +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet,
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million],
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
> {code}
> ... but again, since the subquery returns a bag of tuples, the sort is
> unnecessary.
> We should consider adding an optimizer rule that removes a sort inside a
> subquery. SPARK-23375 is related, but removes sorts that are functionally
> redundant because they perform the same ordering.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]