pchintar opened a new issue, #9762:
URL: https://github.com/apache/arrow-rs/issues/9762
**Description:**
In `arrow-ipc/src/writer.rs`, `flush()` is called unconditionally in:
* `write_body_buffers`
* `write_continuation`
Both functions are part of `write_message`, which is executed for every
record batch and dictionary batch. As a result, each batch write triggers
multiple `flush()` calls.
This places `flush()` directly in the hot path of IPC writing.
**Problem:**
`flush()` is not a cheap operation. For buffered writers (e.g.,
`BufWriter`), it:
* forces buffered data to be written immediately
* breaks write coalescing
* increases the number of write syscalls
* introduces additional overhead involving the OS/kernel
This leads to a flow like:
```text id="m3wmkn"
batch 1: write → flush
batch 2: write → flush
batch 3: write → flush
...
```
instead of allowing batches to be written continuously and flushed only at
an explicit boundary:
```text id="tl1euk"
batch 1: write
batch 2: write
batch 3: write
...
final boundary: flush
```
That difference is important: the current path forces flush work into every
batch write, while the proposed change removes that cost from the per-message
path and keeps flushing at the writer boundary where it belongs.
For workloads with many batches (e.g., streaming or high-partition data),
the current behavior introduces unnecessary I/O overhead and reduces throughput.
**Key point:**
These `flush()` calls are not required for correctness:
* IPC message boundaries are already explicitly defined
* no durability guarantees are provided (no `fsync`/`sync_all`)
* explicit flush control already exists via:
* `FileWriter::finish()`
* public `flush()` methods on both writers
**Proposed change:**
* Remove `flush()` from:
* `write_body_buffers`
* `write_continuation`
* Keep flushing at appropriate boundaries:
* retain `FileWriter::finish()`
* add `self.writer.flush()?` to `StreamWriter::finish()`
**Benefit:**
* eliminates unnecessary per-message flushes in the hot path
* reduces syscall and kernel overhead
* improves effective buffering and write coalescing without affecting
correctness
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]