pchintar opened a new issue, #9762:
URL: https://github.com/apache/arrow-rs/issues/9762

   **Description:**
   
   In `arrow-ipc/src/writer.rs`, `flush()` is called unconditionally in:
   
   * `write_body_buffers`
   * `write_continuation`
   
   Both functions are part of `write_message`, which is executed for every 
record batch and dictionary batch. As a result, each batch write triggers 
multiple `flush()` calls.
   
   This places `flush()` directly in the hot path of IPC writing.
   
   **Problem:**
   
   `flush()` is not a cheap operation. For buffered writers (e.g., 
`BufWriter`), it:
   
   * forces buffered data to be written immediately
   * breaks write coalescing
   * increases the number of write syscalls 
   * introduces additional overhead involving the OS/kernel
   
   This leads to a flow like:
   
   ```text id="m3wmkn"
   batch 1: write → flush
   batch 2: write → flush
   batch 3: write → flush
   ...
   ```
   
   instead of allowing batches to be written continuously and flushed only at 
an explicit boundary:
   
   ```text id="tl1euk"
   batch 1: write
   batch 2: write
   batch 3: write
   ...
   final boundary: flush
   ```
   
   That difference is important: the current path forces flush work into every 
batch write, while the proposed change removes that cost from the per-message 
path and keeps flushing at the writer boundary where it belongs.
   
   For workloads with many batches (e.g., streaming or high-partition data), 
the current behavior introduces unnecessary I/O overhead and reduces throughput.
   
   **Key point:**
   
   These `flush()` calls are not required for correctness:
   
   * IPC message boundaries are already explicitly defined
   * no durability guarantees are provided (no `fsync`/`sync_all`)
   * explicit flush control already exists via:
   
     * `FileWriter::finish()`
     * public `flush()` methods on both writers
   
   **Proposed change:**
   
   * Remove `flush()` from:
   
     * `write_body_buffers`
     * `write_continuation`
   * Keep flushing at appropriate boundaries:
   
     * retain `FileWriter::finish()`
     * add `self.writer.flush()?` to `StreamWriter::finish()`
   
   **Benefit:**
   
   * eliminates unnecessary per-message flushes in the hot path
   * reduces syscall and kernel overhead
   * improves effective buffering and write coalescing without affecting 
correctness


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to