crepererum commented on PR #6929:
URL: 
https://github.com/apache/arrow-datafusion/pull/6929#issuecomment-1637677431

   @YjyJeff Please read the reasoning here before experimenting with unbounded 
constructs:
   
   
https://github.com/apache/arrow-datafusion/blob/49583bd5010282ca126e75100dce958aa346e5ee/datafusion/core/src/physical_plan/repartition/distributor_channels.rs#L18-L39
   
   Mainly, the reason that we are NOT using unbounded channels is that this 
just buffers large (potentially unbounded) quantities of data in many 
scenarios. Mostly this happens when the repartition producer side (e.g. trivial 
data reading like highly compressed parquet files) is way faster than the 
consumer side (e.g. complicated transforms or costly aggregations).
   
   Note that I'm NOT saying that the current impl. is good (or even perfect), 
there are likley better options. But just slapping an unbounded channel on this 
problem is not going to solve it. Sure that wins in some micro-benchmarking but 
it fails to provide a robust foundation for query execution.[^1]
   
   I agree w/ @ozankabak & Co though that SOME buffering is OK. So I think the 
following config option would be robust, reasonably fast and unsurprising to 
the user: A option "repartition buffer bytes OR messages" (or similar name) 
that limits bytes or messages per channel (not per repartition, otherwise the 
cross-comm overhead is too high) and only if this limit is met we fall back to 
the cross-channel gating behavior described in the code comment linked above 
(namely: we let data flow freely if there's at least 1 empty channel).
   
   Ref #4867 and #4865.
   
   [^1]: If someone finds a good metric, benchmark, or test for robustness, 
please open a ticket / PR. I am quite unhappy that this is currently mostly 
based on engineering intuition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to