Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Wes McKinney Tue, 03 Mar 2020 05:47:56 -0800

On Tue, Mar 3, 2020, 7:36 AM Fan Liya <liya.fa...@gmail.com> wrote:

> I am so glad to see this discussion, and I am willing to provide help from
> the Java side.
>
> In the proposal, I see the support for basic compression strategies
> (e.g.gzip, snappy).
> IMO, applying a single basic strategy is not likely to achieve performance
> improvement for most scenarios.
> The optimal compression strategy is often obtained by composing basic
> strategies and tuning parameters.
>
> I hope we can support such highly customized compression strategies.
>


I think very much beyond trivial one-shot buffer level compression is
probably out of the question for addition to the current "RecordBatch"
Flatbuffers type, because the additional metadata would add undesirable
bloat (which I would be against). If people have other ideas it would be
great to see exactly what you are thinking as far as changes to the
protocol files.

I'll try to assemble some examples to show the before/after results of
applying the simple strategy.


>
> Best,
> Liya Fan
>
>
>
> On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > If we want to use a HTTP header, it would be more of a Accept-Encoding
> > header, no?
> >
> > In any case, we would have to put non-standard values there (e.g. lz4),
> > so I'm not sure how desirable it is to repurpose HTTP headers for that,
> > rather than add some dedicated field to the Flight messages.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 03/03/2020 à 12:52, David Li a écrit :
> > > gRPC supports headers so for Flight, we could send essentially an
> Accept
> > > header and perhaps a Content-Type header.
> > >
> > > David
> > >
> > > On Mon, Mar 2, 2020, 23:15 Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> > >
> > >> Hi Wes,
> > >> A few thoughts on this.  In general, I think it is a good idea.  But
> > before
> > >> proceeding, I think the following points are worth discussing:
> > >> 1.  Does this actually improve throughput/latency for Flight? (I think
> > you
> > >> mentioned you would follow-up with benchmarks).
> > >> 2.  I think we should limit the number of supported compression
> schemes
> > to
> > >> only 1 or 2.  I think the criteria for selection speed and native
> > >> implementations available across the widest possible languages.  As
> far
> > as
> > >> i can tell zstd only have bindings in java via JNI, but my
> > understanding is
> > >> it is probably the type of compression for our use-cases.  So I think
> > >> zstd + potentially 1 more.
> > >> 3.  Commitment from someone on the Java side to implement this.
> > >> 4.  This doesn't need to be coupled with this change per-se but for
> > >> something like flight it would be good to have a standard mechanism
> for
> > >> negotiating server/client capabilities (e.g. client doesn't support
> > >> compression or only supports a subset).
> > >>
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >>
> > >>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <anto...@python.org>
> > >> wrote:
> > >>>>
> > >>>>
> > >>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > >>>>> In the context of a "next version of the Feather format" ARROW-5510
> > >>>>> (which is consumed only by Python and R at the moment), I have been
> > >>>>> looking at compressing buffers using fast compressors like ZSTD
> when
> > >>>>> writing the RecordBatch bodies. This could be handled privately as
> an
> > >>>>> implementation detail of the Feather file, but since ZSTD
> compression
> > >>>>> could improve throughput in Flight, for example, I thought I would
> > >>>>> bring it up for discussion.
> > >>>>>
> > >>>>> I can see two simple compression strategies:
> > >>>>>
> > >>>>> * Compress the entire message body in one-shot, writing the result
> > >> out
> > >>>>> with an 8-byte int64 prefix indicating the uncompressed size
> > >>>>> * Compress each non-zero-length constituent Buffer prior to writing
> > >> to
> > >>>>> the body (and using the same uncompressed-length-prefix when
> writing
> > >>>>> the compressed buffer)
> > >>>>>
> > >>>>> The latter strategy is preferable for scenarios where we may
> project
> > >>>>> out only a few fields from a larger record batch (such as reading
> > >> from
> > >>>>> a memory-mapped file).
> > >>>>
> > >>>> Agreed.  It may also allow using different compression strategies
> for
> > >>>> different kinds of buffers (for example a bytestream splitting
> > strategy
> > >>>> for floats and doubles, or a delta encoding strategy for integers).
> > >>>
> > >>> If we wanted to allow for different compression to apply to different
> > >>> buffers, I think we will need a new Message type because this would
> > >>> inflate metadata sizes in a way that is not likely to be acceptable
> > >>> for the current uncompressed use case.
> > >>>
> > >>> Here is my strawman proposal
> > >>>
> > >>>
> > >>
> >
> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> > >>>
> > >>>>> Implementation could be accomplished by one of the following
> methods:
> > >>>>>
> > >>>>> * Setting a field in Message.custom_metadata
> > >>>>> * Adding a new field to Message
> > >>>>
> > >>>> I think it has to be a new field in Message.  Making it an ignorable
> > >>>> metadata field means non-supporting receivers will decode and
> > interpret
> > >>>> the data wrongly.
> > >>>>
> > >>>> Regards
> > >>>>
> > >>>> Antoine.
> > >>>
> > >>
> > >
> >
>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Reply via email to