[DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Wes McKinney Sun, 01 Mar 2020 13:02:11 -0800

In the context of a "next version of the Feather format" ARROW-5510
(which is consumed only by Python and R at the moment), I have been
looking at compressing buffers using fast compressors like ZSTD when
writing the RecordBatch bodies. This could be handled privately as an
implementation detail of the Feather file, but since ZSTD compression
could improve throughput in Flight, for example, I thought I would
bring it up for discussion.


I can see two simple compression strategies:

* Compress the entire message body in one-shot, writing the result out
with an 8-byte int64 prefix indicating the uncompressed size
* Compress each non-zero-length constituent Buffer prior to writing to
the body (and using the same uncompressed-length-prefix when writing
the compressed buffer)

The latter strategy is preferable for scenarios where we may project
out only a few fields from a larger record batch (such as reading from
a memory-mapped file).

Implementation could be accomplished by one of the following methods:

* Setting a field in Message.custom_metadata
* Adding a new field to Message

There have been past discussions about standardizing encodings and
allowing for sparse data representations, so compression could get
rolled up in that, but I still think there would be value in having a
very simple one-shot compression option for record batch bodies, so I
don't think the initiatives are in conflict with each other.

If this were of interest, it would be important to add this to the
columnar specification ASAP for forward compatibility reasons, and any
implementation that does not want to implement decompression right
away can at least raise an error to say "this isn't supported".

thanks
Wes

[DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Reply via email to