zhuqi-lucas commented on PR #8146:
URL: https://github.com/apache/arrow-rs/pull/8146#issuecomment-3194399065

   Optimized behavior (biggest_coalesce_batch_size = Some(limit)) — three cases
   
   - Case 1 — Empty buffer + large incoming batch (Direct bypass)
   
   Condition: incoming.size > limit and buffered_rows == 0.
   
   Action: Bypass coalescing; output the incoming batch unchanged.
   
   Example: limit=500, incoming 600 → output [600].
   
   - Case 2 — Buffer already large + large incoming batch (Flush then bypass)
   
   Condition: incoming.size > limit and buffered_rows > limit.
   
   Action: First flush the buffered rows as one output, then bypass and output 
the incoming batch unchanged.
   
   Purpose: Prevent creating extremely large merged batches that exceed 
expectations.
   
   Example: limit=400, buffer 350+200=550, incoming 800 → outputs [550], [800].
   
   - Case 3 — Small buffer + large incoming batch (Normal coalesce/split)
   
   Condition: incoming.size > limit and buffered_rows <= limit.
   
   Action: Follow normal merging and splitting rules (merge buffer + incoming, 
then split by target_batch_size).
   
   Example: limit=500, buffer 300, incoming 1200 → merge to 1500, split to 
[1000] and buffer [500].


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to