zhuqi-lucas commented on PR #8146: URL: https://github.com/apache/arrow-rs/pull/8146#issuecomment-3194399065
Optimized behavior (biggest_coalesce_batch_size = Some(limit)) — three cases - Case 1 — Empty buffer + large incoming batch (Direct bypass) Condition: incoming.size > limit and buffered_rows == 0. Action: Bypass coalescing; output the incoming batch unchanged. Example: limit=500, incoming 600 → output [600]. - Case 2 — Buffer already large + large incoming batch (Flush then bypass) Condition: incoming.size > limit and buffered_rows > limit. Action: First flush the buffered rows as one output, then bypass and output the incoming batch unchanged. Purpose: Prevent creating extremely large merged batches that exceed expectations. Example: limit=400, buffer 350+200=550, incoming 800 → outputs [550], [800]. - Case 3 — Small buffer + large incoming batch (Normal coalesce/split) Condition: incoming.size > limit and buffered_rows <= limit. Action: Follow normal merging and splitting rules (merge buffer + incoming, then split by target_batch_size). Example: limit=500, buffer 300, incoming 1200 → merge to 1500, split to [1000] and buffer [500]. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org