(I'm not sure why I cannot see the original post but only David's reply) I agree with David that this is not a Java issue. We might need more evidence to support new compression strategies.
Yunhong, do you have any experiment results to support your statement? IMHO, from the perspective of entropy, different buffers may have distinct redundancy so compressing them all together may not be that effective. Simply increasing the number of rows of RecordBatches may help more. Best, Gang On Fri, Feb 21, 2025 at 8:11 AM David Li <lidav...@apache.org> wrote: > Hi Yunhong, > > This isn't a Java issue. The spec for Arrow IPC only supports per-buffer > compression [1]. It does mention other designs as a potential future > improvement there. If you think it might be useful, it could be helpful to > sketch a proposal and/or bring some benchmarks? > > Note that most vectors/arrays are only going to have the data buffer and > maybe a validity buffer, so I'm not sure bundling them together will matter > too much? Are there more details about the overhead you're seeing/your use > case? > > [1]: > https://github.com/apache/arrow/blob/20d8acd89f5ebf87295e08ed10e2f94cb03d57d0/format/Message.fbs#L55-L67 > > Thanks, > David > > On Wed, Feb 19, 2025, at 14:54, yh z wrote: > > Hi, all. Currently, in arrow-java, to do compression for one > > ArrowRecordBatch in VectorUnloader, it will separately compress each > > ArrowBuffer within the FieldVector instead of compress at the FieldVector > > level. From the compression rate perspective, larger batches generally > > result in higher compression rates. Additionally, calling > > compress(BufferAllocator allocator, ArrowBuf uncompressedBuffer) multiple > > times may consume more CPU than call once. > > Therefore, I would like to ask if there will be support for overall > > compression at the FieldVector level, which could improve the compression > > ratio without affecting the ability to read individual columns. > > > > Many thanks, > > Yunhong Zheng >