On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <anto...@python.org> wrote: > > > Le 01/03/2020 à 22:01, Wes McKinney a écrit : > > In the context of a "next version of the Feather format" ARROW-5510 > > (which is consumed only by Python and R at the moment), I have been > > looking at compressing buffers using fast compressors like ZSTD when > > writing the RecordBatch bodies. This could be handled privately as an > > implementation detail of the Feather file, but since ZSTD compression > > could improve throughput in Flight, for example, I thought I would > > bring it up for discussion. > > > > I can see two simple compression strategies: > > > > * Compress the entire message body in one-shot, writing the result out > > with an 8-byte int64 prefix indicating the uncompressed size > > * Compress each non-zero-length constituent Buffer prior to writing to > > the body (and using the same uncompressed-length-prefix when writing > > the compressed buffer) > > > > The latter strategy is preferable for scenarios where we may project > > out only a few fields from a larger record batch (such as reading from > > a memory-mapped file). > > Agreed. It may also allow using different compression strategies for > different kinds of buffers (for example a bytestream splitting strategy > for floats and doubles, or a delta encoding strategy for integers).
If we wanted to allow for different compression to apply to different buffers, I think we will need a new Message type because this would inflate metadata sizes in a way that is not likely to be acceptable for the current uncompressed use case. Here is my strawman proposal https://github.com/apache/arrow/compare/master...wesm:compression-strawman > > Implementation could be accomplished by one of the following methods: > > > > * Setting a field in Message.custom_metadata > > * Adding a new field to Message > > I think it has to be a new field in Message. Making it an ignorable > metadata field means non-supporting receivers will decode and interpret > the data wrongly. > > Regards > > Antoine.