[GitHub] [spark] ankurdave opened a new pull request #30160: [SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder is Stream

GitBox Tue, 27 Oct 2020 04:30:02 -0700


ankurdave opened a new pull request #30160:
URL: https://github.com/apache/spark/pull/30160



   ### What changes were proposed in this pull request?
   
   The following query produces incorrect results. The query has two essential 
features: (1) it contains a string aggregate, resulting in a `SortExec` node, 
and (2) it contains a duplicate grouping key, causing 
`RemoveRepetitionFromGroupExpressions` to produce a sort order stored as a 
`Stream`.
   
   ```sql
   SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
   FROM table_4
   GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
   ```
   
   When the sort order is stored as a `Stream`, the line 
`ordering.map(_.child.genCode(ctx))` in `GenerateOrdering#createOrderKeys()` 
produces unpredictable side effects to `ctx`. This is because `genCode(ctx)` 
modifies `ctx`. When ordering is a `Stream`, the modifications will not happen 
immediately as intended, but will instead occur lazily when the returned 
`Stream` is used later.
   
   Similar bugs have occurred at least three times in the past: 
https://issues.apache.org/jira/browse/SPARK-24500, 
https://issues.apache.org/jira/browse/SPARK-25767, 
https://issues.apache.org/jira/browse/SPARK-26680.
   
   The fix is to check if `ordering` is a `Stream` and force the modifications 
to happen immediately if so.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The test 
previously failed and now passes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ankurdave opened a new pull request #30160: [SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder is Stream

Reply via email to