my-vegetable-has-exploded commented on issue #8393:
URL: 
https://github.com/apache/arrow-datafusion/issues/8393#issuecomment-2008567024

   > So, if i'm not mistaken, this issue is mostly about covering NLJoin in 
[join_selection.rs](https://github.com/apache/arrow-datafusion/blob/abb0c1f62bf622bd0e40769560cf0804dac2ecbf/datafusion/core/src/physical_optimizer/join_selection.rs).
   > 
   
   I think it is a good idea to improve performance in this scenario. Your pr 
is also good for me. But I think it is also ok to keep old parallelism 
strategy. In my opinion,  the old paralleism strategy should works, but the 
check in `enforce_distribution.rs` block the reparition of it whick would check 
the row number. In this query, `pricing_state` 's row numbers is less than 
batch_size and the `RepartitionExec` also just works for a batch a time.
   
   
https://github.com/apache/arrow-datafusion/blob/ad8d552b9f150c3c066b0764e84f72b667a649ff/datafusion/core/src/physical_optimizer/enforce_distribution.rs#L1099-L1106
   
   I think it may another way to write a new enforce_distribution strategy for 
`NestLoopJoin` and `CrossJoin`. We can check `repartition_beneficial_stats` by 
the left table size multiply right partition size rather than just right 
partition size (take RIGHTJOIN for example).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to