[PR] [TASK-431] Byte-sized cap instead of hard one for ArrowWriter [fluss-rust]

via GitHub Wed, 18 Mar 2026 17:42:53 -0700


fresh-borzoni opened a new pull request, #443:
URL: https://github.com/apache/fluss-rust/pull/443


   ## Summary
   closes #431
   
   Arrow log batches were previously capped at a hard 256-record limit 
regardless of actual data size, meaning a batch of 256 tiny ints (~1KB) was 
treated the same as 256 large rows (~10MB). This replaces that fixed cap with 
byte-size-based fullness matching Java's ArrowWriter, so batches now fill to 
the configured writer_batch_size (default 2MB).
   
   The fullness check uses a threshold-based optimization to avoid computing 
sizes on every append, it estimates how many records should fit, skips checks 
until that count is reached, then recalculates. Size estimation reads buffer 
lengths directly from Arrow builders (O(num_columns), zero allocation), with a 
pre-computed IPC overhead constant that captures metadata and body framing for 
the schema.
   
   Compression is accounted for via an adaptive ArrowCompressionRatioEstimator 
shared across batches for the same table. It starts at 1.0 (assume no 
compression) and adjusts asymmetrically after each batch is serialized, quick 
to increase when compression worsens, slow to decrease when it improves. This 
matches Java's conservative approach to avoid underestimating batch sizes.
   
   Also aligns VARIABLE_WIDTH_AVG_BYTES (64 -> 8) and INITIAL_ROW_CAPACITY (256 
-> 1024) with Java Arrow defaults.
   
   Writer pooling (ArrowWriterPool) and DynamicWriteBatchSizeEstimator are out 
of scope  -> TODOs added.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [TASK-431] Byte-sized cap instead of hard one for ArrowWriter [fluss-rust]

Reply via email to