Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Fan Liya Tue, 03 Mar 2020 18:12:25 -0800

Sure. I agree with you that we should not overdo this.
I am wondering if we should provide an option to allow users to plugin
their customized compression strategies.


Best,
Liya Fan

On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney <wesmck...@gmail.com> wrote:

> On Tue, Mar 3, 2020, 7:36 AM Fan Liya <liya.fa...@gmail.com> wrote:
>
> > I am so glad to see this discussion, and I am willing to provide help
> from
> > the Java side.
> >
> > In the proposal, I see the support for basic compression strategies
> > (e.g.gzip, snappy).
> > IMO, applying a single basic strategy is not likely to achieve
> performance
> > improvement for most scenarios.
> > The optimal compression strategy is often obtained by composing basic
> > strategies and tuning parameters.
> >
> > I hope we can support such highly customized compression strategies.
> >
>
> I think very much beyond trivial one-shot buffer level compression is
> probably out of the question for addition to the current "RecordBatch"
> Flatbuffers type, because the additional metadata would add undesirable
> bloat (which I would be against). If people have other ideas it would be
> great to see exactly what you are thinking as far as changes to the
> protocol files.
>
> I'll try to assemble some examples to show the before/after results of
> applying the simple strategy.
>
>
> >
> > Best,
> > Liya Fan
> >
> >
> >
> > On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > If we want to use a HTTP header, it would be more of a Accept-Encoding
> > > header, no?
> > >
> > > In any case, we would have to put non-standard values there (e.g. lz4),
> > > so I'm not sure how desirable it is to repurpose HTTP headers for that,
> > > rather than add some dedicated field to the Flight messages.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 03/03/2020 à 12:52, David Li a écrit :
> > > > gRPC supports headers so for Flight, we could send essentially an
> > Accept
> > > > header and perhaps a Content-Type header.
> > > >
> > > > David
> > > >
> > > > On Mon, Mar 2, 2020, 23:15 Micah Kornfield <emkornfi...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi Wes,
> > > >> A few thoughts on this.  In general, I think it is a good idea.  But
> > > before
> > > >> proceeding, I think the following points are worth discussing:
> > > >> 1.  Does this actually improve throughput/latency for Flight? (I
> think
> > > you
> > > >> mentioned you would follow-up with benchmarks).
> > > >> 2.  I think we should limit the number of supported compression
> > schemes
> > > to
> > > >> only 1 or 2.  I think the criteria for selection speed and native
> > > >> implementations available across the widest possible languages.  As
> > far
> > > as
> > > >> i can tell zstd only have bindings in java via JNI, but my
> > > understanding is
> > > >> it is probably the type of compression for our use-cases.  So I
> think
> > > >> zstd + potentially 1 more.
> > > >> 3.  Commitment from someone on the Java side to implement this.
> > > >> 4.  This doesn't need to be coupled with this change per-se but for
> > > >> something like flight it would be good to have a standard mechanism
> > for
> > > >> negotiating server/client capabilities (e.g. client doesn't support
> > > >> compression or only supports a subset).
> > > >>
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > >>
> > > >>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <anto...@python.org>
> > > >> wrote:
> > > >>>>
> > > >>>>
> > > >>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > > >>>>> In the context of a "next version of the Feather format"
> ARROW-5510
> > > >>>>> (which is consumed only by Python and R at the moment), I have
> been
> > > >>>>> looking at compressing buffers using fast compressors like ZSTD
> > when
> > > >>>>> writing the RecordBatch bodies. This could be handled privately
> as
> > an
> > > >>>>> implementation detail of the Feather file, but since ZSTD
> > compression
> > > >>>>> could improve throughput in Flight, for example, I thought I
> would
> > > >>>>> bring it up for discussion.
> > > >>>>>
> > > >>>>> I can see two simple compression strategies:
> > > >>>>>
> > > >>>>> * Compress the entire message body in one-shot, writing the
> result
> > > >> out
> > > >>>>> with an 8-byte int64 prefix indicating the uncompressed size
> > > >>>>> * Compress each non-zero-length constituent Buffer prior to
> writing
> > > >> to
> > > >>>>> the body (and using the same uncompressed-length-prefix when
> > writing
> > > >>>>> the compressed buffer)
> > > >>>>>
> > > >>>>> The latter strategy is preferable for scenarios where we may
> > project
> > > >>>>> out only a few fields from a larger record batch (such as reading
> > > >> from
> > > >>>>> a memory-mapped file).
> > > >>>>
> > > >>>> Agreed.  It may also allow using different compression strategies
> > for
> > > >>>> different kinds of buffers (for example a bytestream splitting
> > > strategy
> > > >>>> for floats and doubles, or a delta encoding strategy for
> integers).
> > > >>>
> > > >>> If we wanted to allow for different compression to apply to
> different
> > > >>> buffers, I think we will need a new Message type because this would
> > > >>> inflate metadata sizes in a way that is not likely to be acceptable
> > > >>> for the current uncompressed use case.
> > > >>>
> > > >>> Here is my strawman proposal
> > > >>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> > > >>>
> > > >>>>> Implementation could be accomplished by one of the following
> > methods:
> > > >>>>>
> > > >>>>> * Setting a field in Message.custom_metadata
> > > >>>>> * Adding a new field to Message
> > > >>>>
> > > >>>> I think it has to be a new field in Message.  Making it an
> ignorable
> > > >>>> metadata field means non-supporting receivers will decode and
> > > interpret
> > > >>>> the data wrongly.
> > > >>>>
> > > >>>> Regards
> > > >>>>
> > > >>>> Antoine.
> > > >>>
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Reply via email to