mustafasrepo commented on PR #5171: URL: https://github.com/apache/arrow-datafusion/pull/5171#issuecomment-1420617520
> > > @alamb @ozankabak @mustafasrepo > > > Regarding the global sort replaced to a parallel version(SortPreservingMergeExec + Local Sort) optimization, I think there is already a rule `GlobalSortSelection` for the exact purpose. I think we should not let the Sort Enforcement rule to handle this again. Implement/enhance such optimization in the `GlobalSortSelection` rule is more straightforward and do not need to care the positions of the `CoalescePartitionsExec`. > > > > > > I am not sure how we can do all the local sort + merge substitutions just with `GlobalSortSelection`, which doesn't track coalesce operations on partitions as you rightly point out. Note that we handle (and parallel-optimize) not just top level sorts, but sorts at any depth within the plan, even with intermediate executors in between the coalesce operation and the sort in question. > > We will take a deeper look today and see if we can move over the logic to `GlobalSortSelection` while still preserving the same functionality. If we can, great -- if not, we will share an example that blocks this. Thank you for the suggestion 👍 > > Yes, please take a look at the `GlobalSortSelection` rule. This rule does not need to care about the position of `CoalescePartitionsExec` because `CoalescePartitionsExec`s are added by `EnforceDistribution` rule which is triggered after the `GlobalSortSelection` rule. The physical Sort Selection should happen in a very early stage of the physical optimization phase. I guess why the current `GlobalSortSelection` does not optimize all the `Global Sort` is because it is not that aggressive and has an additional check. If you comment that check, all the `Global Sort` should be replaced. > > `&& sort_exec.fetch().is_some()` @mingmwang your suggestion works. This greatly simplifies the code. Thanks for the suggestion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
