Re: [PR] Feat: Revive to use upstream arrow coalesce [datafusion]

via GitHub Wed, 13 Aug 2025 01:18:45 -0700


2010YOUY01 commented on PR #17105:
URL: https://github.com/apache/datafusion/pull/17105#issuecomment-3182704380


   For the `tpch_mem` slowdown, another possible reason could be unnecessary 
copies for batches that are exactly `batch_size`.
   
   For certain operators, there might already be an internal mechanism to 
ensure their output is exactly batch_size. From a quick look at the 
implementation, the old version could pass such batches through directly, 
whereas this PR forces them to be copied.
   
   Another potential improvement: could we make this pass-through threshold 
more lenient? For example, if the coalescer receives a batch with size >= 
`batch_size / 2`, it could pass it through without coalescing. In such cases, 
the output size is already large enough to benefit from vectorization, so the 
extra concatenation might not add much value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Feat: Revive to use upstream arrow coalesce [datafusion]

Reply via email to