Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Fan Liya Fri, 06 Mar 2020 06:43:45 -0800

Hi Wes,

Thanks a lot for the additional information.
Looking forward to see the good results from your experiments.


Best,
Liya Fan

On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney <[email protected]> wrote:

> I see, thank you.
>
> For such a scenario, implementations would need to define a
> "UserDefinedCodec" interface to enable codecs to be registered from
> third party code, similar to what is done for extension types [1]
>
> I'll update this thread when I get my experimental C++ patch up to see
> what I'm thinking at least for the built-in codecs we have like ZSTD.
>
>
> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
>
> On Thu, Mar 5, 2020 at 7:56 AM Fan Liya <[email protected]> wrote:
> >
> > Hi Wes,
> >
> > Thanks a lot for your further clarification.
> >
> > Some of my prelimiary thoughts:
> >
> > 1. We assign a unique GUID to each pair of compression/decompression
> > strategies. The GUID is stored as part of the Message.custom_metadata.
> When
> > receiving the GUID, the receiver knows which decompression strategy to
> use.
> >
> > 2. We serialize the decompression strategy, and store it into the
> > Message.custom_metadata. The receiver can decompress data after
> > deserializing the strategy.
> >
> > Method 1 is generally used in static strategy scenarios while method 2 is
> > generally used in dynamic strategy scenarios.
> >
> > Best,
> > Liya Fan
> >
> > On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney <[email protected]>
> wrote:
> >
> > > Okay, I guess my question is how the receiver is going to be able to
> > > determine how to "rehydrate" the record batch buffers:
> > >
> > > What I've proposed amounts to the following:
> > >
> > > * UNCOMPRESSED: the current behavior
> > > * ZSTD/LZ4/...: each buffer is compressed and written with an int64
> > > length prefix
> > >
> > > (I'm close to putting up a PR implementing an experimental version of
> > > this that uses Message.custom_metadata to transmit the codec, so this
> > > will make the implementation details more concrete)
> > >
> > > So in the USER_DEFINED case, how will the library know how to obtain
> > > the uncompressed buffer? Is some additional metadata structure
> > > required to provide instructions?
> > >
> > > On Wed, Mar 4, 2020 at 8:05 AM Fan Liya <[email protected]> wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > I am thinking of adding an option named "USER_DEFINED" (or something
> > > > similar) to enum CompressionType in your proposal.
> > > > IMO, this option should be used primarily in Flight.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney <[email protected]>
> > > wrote:
> > > >
> > > > > On Tue, Mar 3, 2020, 8:11 PM Fan Liya <[email protected]>
> wrote:
> > > > >
> > > > > > Sure. I agree with you that we should not overdo this.
> > > > > > I am wondering if we should provide an option to allow users to
> > > plugin
> > > > > > their customized compression strategies.
> > > > > >
> > > > >
> > > > > Can you provide a patch showing changes to Message.fbs (or
> Schema.fbs)
> > > that
> > > > > make this idea more concrete?
> > > > >
> > > > >
> > > > > > Best,
> > > > > > Liya Fan
> > > > > >
> > > > > > On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney <[email protected]
> >
> > > wrote:
> > > > > >
> > > > > > > On Tue, Mar 3, 2020, 7:36 AM Fan Liya <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > > > I am so glad to see this discussion, and I am willing to
> provide
> > > help
> > > > > > > from
> > > > > > > > the Java side.
> > > > > > > >
> > > > > > > > In the proposal, I see the support for basic compression
> > > strategies
> > > > > > > > (e.g.gzip, snappy).
> > > > > > > > IMO, applying a single basic strategy is not likely to
> achieve
> > > > > > > performance
> > > > > > > > improvement for most scenarios.
> > > > > > > > The optimal compression strategy is often obtained by
> composing
> > > basic
> > > > > > > > strategies and tuning parameters.
> > > > > > > >
> > > > > > > > I hope we can support such highly customized compression
> > > strategies.
> > > > > > > >
> > > > > > >
> > > > > > > I think very much beyond trivial one-shot buffer level
> compression
> > > is
> > > > > > > probably out of the question for addition to the current
> > > "RecordBatch"
> > > > > > > Flatbuffers type, because the additional metadata would add
> > > undesirable
> > > > > > > bloat (which I would be against). If people have other ideas it
> > > would
> > > > > be
> > > > > > > great to see exactly what you are thinking as far as changes
> to the
> > > > > > > protocol files.
> > > > > > >
> > > > > > > I'll try to assemble some examples to show the before/after
> > > results of
> > > > > > > applying the simple strategy.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Liya Fan
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > If we want to use a HTTP header, it would be more of a
> > > > > > Accept-Encoding
> > > > > > > > > header, no?
> > > > > > > > >
> > > > > > > > > In any case, we would have to put non-standard values there
> > > (e.g.
> > > > > > lz4),
> > > > > > > > > so I'm not sure how desirable it is to repurpose HTTP
> headers
> > > for
> > > > > > that,
> > > > > > > > > rather than add some dedicated field to the Flight
> messages.
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > >
> > > > > > > > > Antoine.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Le 03/03/2020 à 12:52, David Li a écrit :
> > > > > > > > > > gRPC supports headers so for Flight, we could send
> > > essentially an
> > > > > > > > Accept
> > > > > > > > > > header and perhaps a Content-Type header.
> > > > > > > > > >
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > > > On Mon, Mar 2, 2020, 23:15 Micah Kornfield <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Hi Wes,
> > > > > > > > > >> A few thoughts on this.  In general, I think it is a
> good
> > > idea.
> > > > > > But
> > > > > > > > > before
> > > > > > > > > >> proceeding, I think the following points are worth
> > > discussing:
> > > > > > > > > >> 1.  Does this actually improve throughput/latency for
> > > Flight? (I
> > > > > > > think
> > > > > > > > > you
> > > > > > > > > >> mentioned you would follow-up with benchmarks).
> > > > > > > > > >> 2.  I think we should limit the number of supported
> > > compression
> > > > > > > > schemes
> > > > > > > > > to
> > > > > > > > > >> only 1 or 2.  I think the criteria for selection speed
> and
> > > > > native
> > > > > > > > > >> implementations available across the widest possible
> > > languages.
> > > > > > As
> > > > > > > > far
> > > > > > > > > as
> > > > > > > > > >> i can tell zstd only have bindings in java via JNI, but
> my
> > > > > > > > > understanding is
> > > > > > > > > >> it is probably the type of compression for our
> use-cases.
> > > So I
> > > > > > > think
> > > > > > > > > >> zstd + potentially 1 more.
> > > > > > > > > >> 3.  Commitment from someone on the Java side to
> implement
> > > this.
> > > > > > > > > >> 4.  This doesn't need to be coupled with this change
> per-se
> > > but
> > > > > > for
> > > > > > > > > >> something like flight it would be good to have a
> standard
> > > > > > mechanism
> > > > > > > > for
> > > > > > > > > >> negotiating server/client capabilities (e.g. client
> doesn't
> > > > > > support
> > > > > > > > > >> compression or only supports a subset).
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> Thanks,
> > > > > > > > > >> Micah
> > > > > > > > > >>
> > > > > > > > > >> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <
> > > > > > [email protected]>
> > > > > > > > > >> wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > > > > > > > > >>>>> In the context of a "next version of the Feather
> format"
> > > > > > > ARROW-5510
> > > > > > > > > >>>>> (which is consumed only by Python and R at the
> moment), I
> > > > > have
> > > > > > > been
> > > > > > > > > >>>>> looking at compressing buffers using fast compressors
> > > like
> > > > > ZSTD
> > > > > > > > when
> > > > > > > > > >>>>> writing the RecordBatch bodies. This could be handled
> > > > > privately
> > > > > > > as
> > > > > > > > an
> > > > > > > > > >>>>> implementation detail of the Feather file, but since
> ZSTD
> > > > > > > > compression
> > > > > > > > > >>>>> could improve throughput in Flight, for example, I
> > > thought I
> > > > > > > would
> > > > > > > > > >>>>> bring it up for discussion.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> I can see two simple compression strategies:
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> * Compress the entire message body in one-shot,
> writing
> > > the
> > > > > > > result
> > > > > > > > > >> out
> > > > > > > > > >>>>> with an 8-byte int64 prefix indicating the
> uncompressed
> > > size
> > > > > > > > > >>>>> * Compress each non-zero-length constituent Buffer
> prior
> > > to
> > > > > > > writing
> > > > > > > > > >> to
> > > > > > > > > >>>>> the body (and using the same
> uncompressed-length-prefix
> > > when
> > > > > > > > writing
> > > > > > > > > >>>>> the compressed buffer)
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> The latter strategy is preferable for scenarios
> where we
> > > may
> > > > > > > > project
> > > > > > > > > >>>>> out only a few fields from a larger record batch
> (such as
> > > > > > reading
> > > > > > > > > >> from
> > > > > > > > > >>>>> a memory-mapped file).
> > > > > > > > > >>>>
> > > > > > > > > >>>> Agreed.  It may also allow using different compression
> > > > > > strategies
> > > > > > > > for
> > > > > > > > > >>>> different kinds of buffers (for example a bytestream
> > > splitting
> > > > > > > > > strategy
> > > > > > > > > >>>> for floats and doubles, or a delta encoding strategy
> for
> > > > > > > integers).
> > > > > > > > > >>>
> > > > > > > > > >>> If we wanted to allow for different compression to
> apply to
> > > > > > > different
> > > > > > > > > >>> buffers, I think we will need a new Message type
> because
> > > this
> > > > > > would
> > > > > > > > > >>> inflate metadata sizes in a way that is not likely to
> be
> > > > > > acceptable
> > > > > > > > > >>> for the current uncompressed use case.
> > > > > > > > > >>>
> > > > > > > > > >>> Here is my strawman proposal
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> > > > > > > > > >>>
> > > > > > > > > >>>>> Implementation could be accomplished by one of the
> > > following
> > > > > > > > methods:
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> * Setting a field in Message.custom_metadata
> > > > > > > > > >>>>> * Adding a new field to Message
> > > > > > > > > >>>>
> > > > > > > > > >>>> I think it has to be a new field in Message.  Making
> it an
> > > > > > > ignorable
> > > > > > > > > >>>> metadata field means non-supporting receivers will
> decode
> > > and
> > > > > > > > > interpret
> > > > > > > > > >>>> the data wrongly.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Regards
> > > > > > > > > >>>>
> > > > > > > > > >>>> Antoine.
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Reply via email to