Hi Wes,
A few thoughts on this.  In general, I think it is a good idea.  But before
proceeding, I think the following points are worth discussing:
1.  Does this actually improve throughput/latency for Flight? (I think you
mentioned you would follow-up with benchmarks).
2.  I think we should limit the number of supported compression schemes to
only 1 or 2.  I think the criteria for selection speed and native
implementations available across the widest possible languages.  As far as
i can tell zstd only have bindings in java via JNI, but my understanding is
it is probably the type of compression for our use-cases.  So I think
zstd + potentially 1 more.
3.  Commitment from someone on the Java side to implement this.
4.  This doesn't need to be coupled with this change per-se but for
something like flight it would be good to have a standard mechanism for
negotiating server/client capabilities (e.g. client doesn't support
compression or only supports a subset).


Thanks,
Micah

On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <wesmck...@gmail.com> wrote:

> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <anto...@python.org> wrote:
> >
> >
> > Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > > In the context of a "next version of the Feather format" ARROW-5510
> > > (which is consumed only by Python and R at the moment), I have been
> > > looking at compressing buffers using fast compressors like ZSTD when
> > > writing the RecordBatch bodies. This could be handled privately as an
> > > implementation detail of the Feather file, but since ZSTD compression
> > > could improve throughput in Flight, for example, I thought I would
> > > bring it up for discussion.
> > >
> > > I can see two simple compression strategies:
> > >
> > > * Compress the entire message body in one-shot, writing the result out
> > > with an 8-byte int64 prefix indicating the uncompressed size
> > > * Compress each non-zero-length constituent Buffer prior to writing to
> > > the body (and using the same uncompressed-length-prefix when writing
> > > the compressed buffer)
> > >
> > > The latter strategy is preferable for scenarios where we may project
> > > out only a few fields from a larger record batch (such as reading from
> > > a memory-mapped file).
> >
> > Agreed.  It may also allow using different compression strategies for
> > different kinds of buffers (for example a bytestream splitting strategy
> > for floats and doubles, or a delta encoding strategy for integers).
>
> If we wanted to allow for different compression to apply to different
> buffers, I think we will need a new Message type because this would
> inflate metadata sizes in a way that is not likely to be acceptable
> for the current uncompressed use case.
>
> Here is my strawman proposal
>
> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
>
> > > Implementation could be accomplished by one of the following methods:
> > >
> > > * Setting a field in Message.custom_metadata
> > > * Adding a new field to Message
> >
> > I think it has to be a new field in Message.  Making it an ignorable
> > metadata field means non-supporting receivers will decode and interpret
> > the data wrongly.
> >
> > Regards
> >
> > Antoine.
>

Reply via email to