wolffcm commented on issue #7077: URL: https://github.com/apache/arrow-datafusion/issues/7077#issuecomment-1652746677
Looking into this more I can see that `pushdown_sorts` will not push a `SortExec` through a `RepartitionExec` node. This makes sense since in general this means that the sort will be performed on parallel streams of data, which is good. The problem is that in some cases (the case that I care about) we can further push down the sort through `UnionExec` node then end up not needing to sort in one or more inputs to the union which is advantageous. Incidentally, if no `SortExec` nodes were needed at all, such as if both sides of the `UnionExec` were sorted, I think that `replace_with_order_preserving_variants` would catch this case. So I am left with wanting to push a `SortExec` node through `RepartionExec`s but only some of the time. I think the right heuristic is: push down `SortExec` if it would result in needing to sort fewer tuples overall. This is a new bit of analysis that I think `pushdown_sorts` will need to perform. @alamb @mustafasrepo @ozankabak Do you all think that this approach makes sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
