ding-young commented on issue #16367: URL: https://github.com/apache/datafusion/issues/16367#issuecomment-3000103132
### Need for a Custom Batch Writer? #### 1. `concat_batches` before writing? I tried a quick local test where, instead of writing one batch at a time using the current `IPCStreamWriter`, I concatenated multiple batches using `concat_batches` before writing them. In my local environment, this didn't make a noticeable difference in compression ratio. Maybe that's because the compression happens at the buffer level for each column (i.e., values of the same column are grouped together), or perhaps because each record batch already consists of 8192 rows and the compression window size overlaps with that. Still, even if concatenating batches introduces some memory copy overhead, it might still impact I/O bandwidth or reduce the number of system calls, so I think it's worth investigating further. #### 2. Comet's implementation I looked into why Comet introduced a custom batch writer and reviewed the related PR. The main reasons their implementation improved performance were: (a) Their previous approach duplicated the schema for each batch, which the new implementation avoided. (b) They didn’t use FlatBuffer encoding, so there was no alignment or metadata overhead. In our case, though, since `IPCStreamWriter` already writes the schema only once when the writer is created, we probably won’t see the same benefits from (a). I haven’t had a chance to look closely at the Vortex side yet. If I come across any interesting experimental results or ideas worth sharing, I’ll follow up later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org