wecharyu commented on PR #49027:
URL: https://github.com/apache/spark/pull/49027#issuecomment-2524174137
> Why do you think it's bad to eliminate the user sort in the middle?
We add a local sort for some specific columns during writing data, which
helps reduce the file size by gaining a higher compression ratio on sorted
data. So the user sort is meaningful in writing data.
> And shall we be dumb and just don't remove this sort to sort twice?
We have also considered this solution, but actually only the top local sort
need be considered in `V1Writes`, for example if a query of write has two
original local sorts:
```bash
- Project [col1, col2]
- Sort [col2 ASC NULLS FIRST], false
- Repartition 1, true
- Sort [col1 ASC NULLS FIRST], false
- ...
```
After insert a local sort for dynamic partition `part1`, the query plan
becomes:
```bash
- Sort [part1 ASC NULLS FIRST], false
- Project [col1, col2]
- Sort [col2 ASC NULLS FIRST], false
- Repartition 1, true
- Sort [col1 ASC NULLS FIRST], false
- ...
```
If we skip sorts elimination in this case, each task would execute three
sorts, and the `col1` in each partition would be unordered in the final result.
So `Sort [col1 ASC NULLS FIRST], false` is unnecessary and could be eliminated.
What's even worse, if we are not using the stable sort algorithms, the `col2`
may be unordered in each partition either. Concatenating the required ordering
with the output ordering can resolve this issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]