my-vegetable-has-exploded commented on issue #8393: URL: https://github.com/apache/arrow-datafusion/issues/8393#issuecomment-2008567024
> So, if i'm not mistaken, this issue is mostly about covering NLJoin in [join_selection.rs](https://github.com/apache/arrow-datafusion/blob/abb0c1f62bf622bd0e40769560cf0804dac2ecbf/datafusion/core/src/physical_optimizer/join_selection.rs). > I think it is a good idea to improve performance in this scenario. Your pr is also good for me. But I think it is also ok to keep old parallelism strategy. In my opinion, the old paralleism strategy should works, but the check in `enforce_distribution.rs` block the reparition of it whick would check the row number. In this query, `pricing_state` 's row numbers is less than batch_size and the `RepartitionExec` also just works for a batch a time. https://github.com/apache/arrow-datafusion/blob/ad8d552b9f150c3c066b0764e84f72b667a649ff/datafusion/core/src/physical_optimizer/enforce_distribution.rs#L1099-L1106 I think it may another way to write a new enforce_distribution strategy for `NestLoopJoin` and `CrossJoin`. We can check `repartition_beneficial_stats` by the left table size multiply right partition size rather than just right partition size (take RIGHTJOIN for example). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
