On Sun, Mar 1, 2020 at 3:01 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> In the context of a "next version of the Feather format" ARROW-5510
> (which is consumed only by Python and R at the moment), I have been
> looking at compressing buffers using fast compressors like ZSTD when
> writing the RecordBatch bodies. This could be handled privately as an
> implementation detail of the Feather file, but since ZSTD compression
> could improve throughput in Flight, for example, I thought I would
> bring it up for discussion.

I should also add that I'm nearly done with implementing this for
experimentation purposes which would allow us to collect some
benchmark data about how this affects Flight throughput on data having
good compression ratios.

> I can see two simple compression strategies:
>
> * Compress the entire message body in one-shot, writing the result out
> with an 8-byte int64 prefix indicating the uncompressed size
> * Compress each non-zero-length constituent Buffer prior to writing to
> the body (and using the same uncompressed-length-prefix when writing
> the compressed buffer)
>
> The latter strategy is preferable for scenarios where we may project
> out only a few fields from a larger record batch (such as reading from
> a memory-mapped file).
>
> Implementation could be accomplished by one of the following methods:
>
> * Setting a field in Message.custom_metadata
> * Adding a new field to Message
>
> There have been past discussions about standardizing encodings and
> allowing for sparse data representations, so compression could get
> rolled up in that, but I still think there would be value in having a
> very simple one-shot compression option for record batch bodies, so I
> don't think the initiatives are in conflict with each other.
>
> If this were of interest, it would be important to add this to the
> columnar specification ASAP for forward compatibility reasons, and any
> implementation that does not want to implement decompression right
> away can at least raise an error to say "this isn't supported".
>
> thanks
> Wes

Reply via email to