[jira] [Commented] (SPARK-23957) Sorts in subqueries are redundant and can be removed

Apache Spark (JIRA) Mon, 23 Jul 2018 14:43:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553434#comment-16553434
 ]


Apache Spark commented on SPARK-23957:
--------------------------------------

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/21853

> Sorts in subqueries are redundant and can be removed
> ----------------------------------------------------
>
>                 Key: SPARK-23957
>                 URL: https://issues.apache.org/jira/browse/SPARK-23957
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Henry Robinson
>            Priority: Major
>
> Unless combined with a {{LIMIT}}, there's no correctness reason that planned 
> and optimized subqueries should have any sort operators (since the result of 
> the subquery is an unordered collection of tuples). 
> For example:
> {{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
> has the following plan:
> {code:java}
> == Physical Plan ==
> *(3) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>    +- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
>       +- *(2) Project
>          +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
>             +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>                +- *(1) Project [id#0L]
>                   +- *(1) FileScan parquet [id#0L] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<id:bigint>
> {code}
> ... but the sort operator is redundant.
> Less intuitively, the sort is also redundant in selections from an ordered 
> subquery:
> {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
> has plan:
> {code:java}
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>    +- *(1) Project [id#0L]
>       +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
> {code}
> ... but again, since the subquery returns a bag of tuples, the sort is 
> unnecessary.
> We should consider adding an optimizer rule that removes a sort inside a 
> subquery. SPARK-23375 is related, but removes sorts that are functionally 
> redundant because they perform the same ordering.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-23957) Sorts in subqueries are redundant and can be removed

Reply via email to