Re: [PR] [SPARK-50469][SQL] V1Writes should respect the output ordering [spark]

via GitHub Fri, 06 Dec 2024 13:06:16 -0800


wecharyu commented on PR #49027:
URL: https://github.com/apache/spark/pull/49027#issuecomment-2524174137


   > Why do you think it's bad to eliminate the user sort in the middle?
   
   We add a local sort for some specific columns during writing data, which 
helps reduce the file size by gaining a higher compression ratio on sorted 
data. So the user sort is meaningful in writing data.
   
   > And shall we be dumb and just don't remove this sort to sort twice?
   
   We have also considered this solution, but actually only the top local sort 
need be considered in `V1Writes`, for example if a query of write has two 
original local sorts:
   ```bash
    - Project [col1, col2]
       - Sort [col2 ASC NULLS FIRST], false
          - Repartition 1, true
             - Sort [col1 ASC NULLS FIRST], false
                - ...
   ```
   After insert a local sort for dynamic partition `part1`, the query plan 
becomes:
   ```bash
   - Sort [part1 ASC NULLS FIRST], false
     - Project [col1, col2]
       - Sort [col2 ASC NULLS FIRST], false
          - Repartition 1, true
             - Sort [col1 ASC NULLS FIRST], false
                - ...
   ```
   If we skip sorts elimination in this case, each task would execute three 
sorts, and the `col1` in each partition would be unordered in the final result. 
So `Sort [col1 ASC NULLS FIRST], false` is unnecessary and could be eliminated. 
What's even worse, if we are not using the stable sort algorithms, the `col2` 
may be unordered in each partition either.  Concatenating the required ordering 
with the output ordering can resolve this issue.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50469][SQL] V1Writes should respect the output ordering [spark]

Reply via email to