Hi Wes, A few thoughts on this. In general, I think it is a good idea. But before proceeding, I think the following points are worth discussing: 1. Does this actually improve throughput/latency for Flight? (I think you mentioned you would follow-up with benchmarks). 2. I think we should limit the number of supported compression schemes to only 1 or 2. I think the criteria for selection speed and native implementations available across the widest possible languages. As far as i can tell zstd only have bindings in java via JNI, but my understanding is it is probably the type of compression for our use-cases. So I think zstd + potentially 1 more. 3. Commitment from someone on the Java side to implement this. 4. This doesn't need to be coupled with this change per-se but for something like flight it would be good to have a standard mechanism for negotiating server/client capabilities (e.g. client doesn't support compression or only supports a subset).
Thanks, Micah On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <wesmck...@gmail.com> wrote: > On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Le 01/03/2020 à 22:01, Wes McKinney a écrit : > > > In the context of a "next version of the Feather format" ARROW-5510 > > > (which is consumed only by Python and R at the moment), I have been > > > looking at compressing buffers using fast compressors like ZSTD when > > > writing the RecordBatch bodies. This could be handled privately as an > > > implementation detail of the Feather file, but since ZSTD compression > > > could improve throughput in Flight, for example, I thought I would > > > bring it up for discussion. > > > > > > I can see two simple compression strategies: > > > > > > * Compress the entire message body in one-shot, writing the result out > > > with an 8-byte int64 prefix indicating the uncompressed size > > > * Compress each non-zero-length constituent Buffer prior to writing to > > > the body (and using the same uncompressed-length-prefix when writing > > > the compressed buffer) > > > > > > The latter strategy is preferable for scenarios where we may project > > > out only a few fields from a larger record batch (such as reading from > > > a memory-mapped file). > > > > Agreed. It may also allow using different compression strategies for > > different kinds of buffers (for example a bytestream splitting strategy > > for floats and doubles, or a delta encoding strategy for integers). > > If we wanted to allow for different compression to apply to different > buffers, I think we will need a new Message type because this would > inflate metadata sizes in a way that is not likely to be acceptable > for the current uncompressed use case. > > Here is my strawman proposal > > https://github.com/apache/arrow/compare/master...wesm:compression-strawman > > > > Implementation could be accomplished by one of the following methods: > > > > > > * Setting a field in Message.custom_metadata > > > * Adding a new field to Message > > > > I think it has to be a new field in Message. Making it an ignorable > > metadata field means non-supporting receivers will decode and interpret > > the data wrongly. > > > > Regards > > > > Antoine. >