Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Wes McKinney Thu, 26 Mar 2020 14:58:18 -0700

Here are the results:

File size: https://ibb.co/71sBsg3
Read time: https://ibb.co/4ZncdF8
Write time: https://ibb.co/xhNkRS2


Code: 
https://github.com/wesm/notebooks/blob/master/20190919file_benchmarks/FeatherCompression.ipynb
(based on https://github.com/apache/arrow/pull/6694)

High level summary:

* Chunksize 1024 vs 64K has relatively limited impact on file sizes

* Wall clock read time is impacted by chunksize, maybe 30-40%
difference between 1K row chunks versus 16K row chunks. One notable
thing is that you can see clearly the overhead associated with IPC
reconstruction even when the data is memory mapped. For example, in
the Fannie Mae dataset there are 21,661 batches (each batch has 31
fields) when the chunksize is 1024. So a read time of 1.3 seconds
indicates ~60 microseconds of overhead for each record batch. When you
consider the amount of business logic involved with reconstructing a
record batch, 60 microseconds is pretty good. This also shows that
every microsecond counts and we need to be carefully tracking
microperformance in this critical operation.

* Small chunksize results in higher write times for "expensive" codecs
like ZSTD with a high compression ratio. For "cheap" codecs like LZ4
it doesn't make as much of a difference

* Note that LZ4 compressor results in faster wall clock time to disk
presumably because the compression speed is faster than my SSD's write
speed

Implementation notes:
* There is no parallelization or pipelining of reads or writes. For
example, on write, all of the buffers are compressed with a single
thread and then compression stops until the write to disk completes.
On read, buffers are decompressed serially


On Thu, Mar 26, 2020 at 12:24 PM Wes McKinney <[email protected]> wrote:
>
> I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you
> know the read/write times and compression ratios. Shouldn't take too
> long
>
> On Wed, Mar 25, 2020 at 10:37 PM Fan Liya <[email protected]> wrote:
> >
> > Thanks a lot for sharing the good results.
> >
> > As investigated by Wes, we have existing zstd library for Java (zstd-jni) 
> > [1], and lz4 library for Java (lz4-java) [2].
> > +1 for the 1024 batch size, as it represents an important scenario where 
> > the batch fits into the L1 cache (IMO).
> >
> > Best,
> > Liya Fan
> >
> > [1] https://github.com/luben/zstd-jni
> > [2] https://github.com/lz4/lz4-java
> >
> > On Thu, Mar 26, 2020 at 2:38 AM Micah Kornfield <[email protected]> 
> > wrote:
> >>
> >> If it isn't hard could you run with batch sizes of 1024 or 2048 records?  I
> >> think there was a question previously raised if there was benefit for
> >> smaller sizes buffers.
> >>
> >> Thanks,
> >> Micah
> >>
> >>
> >> On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney <[email protected]> wrote:
> >>
> >> > On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield <[email protected]>
> >> > wrote:
> >> > >
> >> > > >
> >> > > > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> >> > > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie 
> >> > > > Mae
> >> > > > dataset. So that's a huge space savings
> >> > >
> >> > > One more question on this.  What was the average row-batch size used?  
> >> > > I
> >> > > see in the proposal some buffers might not be compressed, did you this
> >> > > feature in the test?
> >> >
> >> > I used 64K row batch size. I haven't implemented the optional
> >> > non-compressed buffers (for cases where there is little space savings)
> >> > so everything is compressed. I can check different batch sizes if you
> >> > like
> >> >
> >> >
> >> > > On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney <[email protected]>
> >> > wrote:
> >> > >
> >> > > > hi folks,
> >> > > >
> >> > > > Sorry it's taken me a little while to produce supporting benchmarks.
> >> > > >
> >> > > > * I implemented experimental trivial body buffer compression in
> >> > > > https://github.com/apache/arrow/pull/6638
> >> > > > * I hooked up the Arrow IPC file format with compression as the new
> >> > > > Feather V2 format in
> >> > > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> >> > > >
> >> > > > I tested a couple of real-world datasets from a prior blog post
> >> > > > https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4
> >> > > > codecs
> >> > > >
> >> > > > The complete results are here
> >> > > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> >> > > >
> >> > > > Summary:
> >> > > >
> >> > > > * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> >> > > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie 
> >> > > > Mae
> >> > > > dataset. So that's a huge space savings
> >> > > > * Single-threaded decompression times exceeding 2-4GByte/s with LZ4
> >> > > > and 1.2-3GByte/s with ZSTD
> >> > > >
> >> > > > I would have to do some more engineering to test throughput changes
> >> > > > with Flight, but given these results on slower networking (e.g. 1
> >> > > > Gigabit) my guess is that the compression and decompression overhead
> >> > > > is little compared with the time savings due to high compression
> >> > > > ratios. If people would like to see these numbers to help make a
> >> > > > decision I can take a closer look
> >> > > >
> >> > > > As far as what Micah said about having a limited number of
> >> > > > compressors: I would be in favor of having just LZ4 and ZSTD. It 
> >> > > > seems
> >> > > > anecdotally that these outperform Snappy in most real world scenarios
> >> > > > and generally have > 1 GB/s decompression performance. Some Linux
> >> > > > distributions (Arch at least) have already started adopting ZSTD over
> >> > > > LZMA or GZIP [1]
> >> > > >
> >> > > > - Wes
> >> > > >
> >> > > > [1]:
> >> > > >
> >> > https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/
> >> > > >
> >> > > > On Fri, Mar 6, 2020 at 8:42 AM Fan Liya <[email protected]> wrote:
> >> > > > >
> >> > > > > Hi Wes,
> >> > > > >
> >> > > > > Thanks a lot for the additional information.
> >> > > > > Looking forward to see the good results from your experiments.
> >> > > > >
> >> > > > > Best,
> >> > > > > Liya Fan
> >> > > > >
> >> > > > > On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney <[email protected]>
> >> > > > wrote:
> >> > > > >
> >> > > > > > I see, thank you.
> >> > > > > >
> >> > > > > > For such a scenario, implementations would need to define a
> >> > > > > > "UserDefinedCodec" interface to enable codecs to be registered 
> >> > > > > > from
> >> > > > > > third party code, similar to what is done for extension types [1]
> >> > > > > >
> >> > > > > > I'll update this thread when I get my experimental C++ patch up 
> >> > > > > > to
> >> > see
> >> > > > > > what I'm thinking at least for the built-in codecs we have like
> >> > ZSTD.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > >
> >> > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
> >> > > > > >
> >> > > > > > On Thu, Mar 5, 2020 at 7:56 AM Fan Liya <[email protected]>
> >> > wrote:
> >> > > > > > >
> >> > > > > > > Hi Wes,
> >> > > > > > >
> >> > > > > > > Thanks a lot for your further clarification.
> >> > > > > > >
> >> > > > > > > Some of my prelimiary thoughts:
> >> > > > > > >
> >> > > > > > > 1. We assign a unique GUID to each pair of
> >> > compression/decompression
> >> > > > > > > strategies. The GUID is stored as part of the
> >> > > > Message.custom_metadata.
> >> > > > > > When
> >> > > > > > > receiving the GUID, the receiver knows which decompression
> >> > strategy
> >> > > > to
> >> > > > > > use.
> >> > > > > > >
> >> > > > > > > 2. We serialize the decompression strategy, and store it into 
> >> > > > > > > the
> >> > > > > > > Message.custom_metadata. The receiver can decompress data after
> >> > > > > > > deserializing the strategy.
> >> > > > > > >
> >> > > > > > > Method 1 is generally used in static strategy scenarios while
> >> > method
> >> > > > 2 is
> >> > > > > > > generally used in dynamic strategy scenarios.
> >> > > > > > >
> >> > > > > > > Best,
> >> > > > > > > Liya Fan
> >> > > > > > >
> >> > > > > > > On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney <
> >> > [email protected]>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Okay, I guess my question is how the receiver is going to be
> >> > able
> >> > > > to
> >> > > > > > > > determine how to "rehydrate" the record batch buffers:
> >> > > > > > > >
> >> > > > > > > > What I've proposed amounts to the following:
> >> > > > > > > >
> >> > > > > > > > * UNCOMPRESSED: the current behavior
> >> > > > > > > > * ZSTD/LZ4/...: each buffer is compressed and written with an
> >> > int64
> >> > > > > > > > length prefix
> >> > > > > > > >
> >> > > > > > > > (I'm close to putting up a PR implementing an experimental
> >> > version
> >> > > > of
> >> > > > > > > > this that uses Message.custom_metadata to transmit the codec,
> >> > so
> >> > > > this
> >> > > > > > > > will make the implementation details more concrete)
> >> > > > > > > >
> >> > > > > > > > So in the USER_DEFINED case, how will the library know how to
> >> > > > obtain
> >> > > > > > > > the uncompressed buffer? Is some additional metadata 
> >> > > > > > > > structure
> >> > > > > > > > required to provide instructions?
> >> > > > > > > >
> >> > > > > > > > On Wed, Mar 4, 2020 at 8:05 AM Fan Liya 
> >> > > > > > > > <[email protected]>
> >> > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > Hi Wes,
> >> > > > > > > > >
> >> > > > > > > > > I am thinking of adding an option named "USER_DEFINED" (or
> >> > > > something
> >> > > > > > > > > similar) to enum CompressionType in your proposal.
> >> > > > > > > > > IMO, this option should be used primarily in Flight.
> >> > > > > > > > >
> >> > > > > > > > > Best,
> >> > > > > > > > > Liya Fan
> >> > > > > > > > >
> >> > > > > > > > > On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney <
> >> > > > [email protected]>
> >> > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > On Tue, Mar 3, 2020, 8:11 PM Fan Liya <
> >> > [email protected]>
> >> > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Sure. I agree with you that we should not overdo this.
> >> > > > > > > > > > > I am wondering if we should provide an option to allow
> >> > users
> >> > > > to
> >> > > > > > > > plugin
> >> > > > > > > > > > > their customized compression strategies.
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > Can you provide a patch showing changes to Message.fbs 
> >> > > > > > > > > > (or
> >> > > > > > Schema.fbs)
> >> > > > > > > > that
> >> > > > > > > > > > make this idea more concrete?
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > > Best,
> >> > > > > > > > > > > Liya Fan
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney <
> >> > > > [email protected]
> >> > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > On Tue, Mar 3, 2020, 7:36 AM Fan Liya <
> >> > > > [email protected]>
> >> > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > I am so glad to see this discussion, and I am
> >> > willing to
> >> > > > > > provide
> >> > > > > > > > help
> >> > > > > > > > > > > > from
> >> > > > > > > > > > > > > the Java side.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > In the proposal, I see the support for basic
> >> > compression
> >> > > > > > > > strategies
> >> > > > > > > > > > > > > (e.g.gzip, snappy).
> >> > > > > > > > > > > > > IMO, applying a single basic strategy is not likely
> >> > to
> >> > > > > > achieve
> >> > > > > > > > > > > > performance
> >> > > > > > > > > > > > > improvement for most scenarios.
> >> > > > > > > > > > > > > The optimal compression strategy is often obtained 
> >> > > > > > > > > > > > > by
> >> > > > > > composing
> >> > > > > > > > basic
> >> > > > > > > > > > > > > strategies and tuning parameters.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I hope we can support such highly customized
> >> > compression
> >> > > > > > > > strategies.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > I think very much beyond trivial one-shot buffer 
> >> > > > > > > > > > > > level
> >> > > > > > compression
> >> > > > > > > > is
> >> > > > > > > > > > > > probably out of the question for addition to the
> >> > current
> >> > > > > > > > "RecordBatch"
> >> > > > > > > > > > > > Flatbuffers type, because the additional metadata
> >> > would add
> >> > > > > > > > undesirable
> >> > > > > > > > > > > > bloat (which I would be against). If people have 
> >> > > > > > > > > > > > other
> >> > > > ideas it
> >> > > > > > > > would
> >> > > > > > > > > > be
> >> > > > > > > > > > > > great to see exactly what you are thinking as far as
> >> > > > changes
> >> > > > > > to the
> >> > > > > > > > > > > > protocol files.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > I'll try to assemble some examples to show the
> >> > before/after
> >> > > > > > > > results of
> >> > > > > > > > > > > > applying the simple strategy.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Best,
> >> > > > > > > > > > > > > Liya Fan
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <
> >> > > > > > > > [email protected]>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > If we want to use a HTTP header, it would be more
> >> > of a
> >> > > > > > > > > > > Accept-Encoding
> >> > > > > > > > > > > > > > header, no?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > In any case, we would have to put non-standard
> >> > values
> >> > > > there
> >> > > > > > > > (e.g.
> >> > > > > > > > > > > lz4),
> >> > > > > > > > > > > > > > so I'm not sure how desirable it is to repurpose
> >> > HTTP
> >> > > > > > headers
> >> > > > > > > > for
> >> > > > > > > > > > > that,
> >> > > > > > > > > > > > > > rather than add some dedicated field to the 
> >> > > > > > > > > > > > > > Flight
> >> > > > > > messages.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Regards
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Antoine.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Le 03/03/2020 à 12:52, David Li a écrit :
> >> > > > > > > > > > > > > > > gRPC supports headers so for Flight, we could
> >> > send
> >> > > > > > > > essentially an
> >> > > > > > > > > > > > > Accept
> >> > > > > > > > > > > > > > > header and perhaps a Content-Type header.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > David
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Mon, Mar 2, 2020, 23:15 Micah Kornfield <
> >> > > > > > > > > > [email protected]>
> >> > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >> Hi Wes,
> >> > > > > > > > > > > > > > >> A few thoughts on this.  In general, I think 
> >> > > > > > > > > > > > > > >> it
> >> > is a
> >> > > > > > good
> >> > > > > > > > idea.
> >> > > > > > > > > > > But
> >> > > > > > > > > > > > > > before
> >> > > > > > > > > > > > > > >> proceeding, I think the following points are
> >> > worth
> >> > > > > > > > discussing:
> >> > > > > > > > > > > > > > >> 1.  Does this actually improve
> >> > throughput/latency
> >> > > > for
> >> > > > > > > > Flight? (I
> >> > > > > > > > > > > > think
> >> > > > > > > > > > > > > > you
> >> > > > > > > > > > > > > > >> mentioned you would follow-up with 
> >> > > > > > > > > > > > > > >> benchmarks).
> >> > > > > > > > > > > > > > >> 2.  I think we should limit the number of
> >> > supported
> >> > > > > > > > compression
> >> > > > > > > > > > > > > schemes
> >> > > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > >> only 1 or 2.  I think the criteria for 
> >> > > > > > > > > > > > > > >> selection
> >> > > > speed
> >> > > > > > and
> >> > > > > > > > > > native
> >> > > > > > > > > > > > > > >> implementations available across the widest
> >> > possible
> >> > > > > > > > languages.
> >> > > > > > > > > > > As
> >> > > > > > > > > > > > > far
> >> > > > > > > > > > > > > > as
> >> > > > > > > > > > > > > > >> i can tell zstd only have bindings in java via
> >> > JNI,
> >> > > > but
> >> > > > > > my
> >> > > > > > > > > > > > > > understanding is
> >> > > > > > > > > > > > > > >> it is probably the type of compression for our
> >> > > > > > use-cases.
> >> > > > > > > > So I
> >> > > > > > > > > > > > think
> >> > > > > > > > > > > > > > >> zstd + potentially 1 more.
> >> > > > > > > > > > > > > > >> 3.  Commitment from someone on the Java side 
> >> > > > > > > > > > > > > > >> to
> >> > > > > > implement
> >> > > > > > > > this.
> >> > > > > > > > > > > > > > >> 4.  This doesn't need to be coupled with this
> >> > change
> >> > > > > > per-se
> >> > > > > > > > but
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > > > >> something like flight it would be good to 
> >> > > > > > > > > > > > > > >> have a
> >> > > > > > standard
> >> > > > > > > > > > > mechanism
> >> > > > > > > > > > > > > for
> >> > > > > > > > > > > > > > >> negotiating server/client capabilities (e.g.
> >> > client
> >> > > > > > doesn't
> >> > > > > > > > > > > support
> >> > > > > > > > > > > > > > >> compression or only supports a subset).
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >> Thanks,
> >> > > > > > > > > > > > > > >> Micah
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <
> >> > > > > > > > > > [email protected]>
> >> > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >>> On Sun, Mar 1, 2020 at 3:14 PM Antoine 
> >> > > > > > > > > > > > > > >>> Pitrou <
> >> > > > > > > > > > > [email protected]>
> >> > > > > > > > > > > > > > >> wrote:
> >> > > > > > > > > > > > > > >>>>
> >> > > > > > > > > > > > > > >>>>
> >> > > > > > > > > > > > > > >>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit 
> >> > > > > > > > > > > > > > >>>> :
> >> > > > > > > > > > > > > > >>>>> In the context of a "next version of the
> >> > Feather
> >> > > > > > format"
> >> > > > > > > > > > > > ARROW-5510
> >> > > > > > > > > > > > > > >>>>> (which is consumed only by Python and R at
> >> > the
> >> > > > > > moment), I
> >> > > > > > > > > > have
> >> > > > > > > > > > > > been
> >> > > > > > > > > > > > > > >>>>> looking at compressing buffers using fast
> >> > > > compressors
> >> > > > > > > > like
> >> > > > > > > > > > ZSTD
> >> > > > > > > > > > > > > when
> >> > > > > > > > > > > > > > >>>>> writing the RecordBatch bodies. This could 
> >> > > > > > > > > > > > > > >>>>> be
> >> > > > handled
> >> > > > > > > > > > privately
> >> > > > > > > > > > > > as
> >> > > > > > > > > > > > > an
> >> > > > > > > > > > > > > > >>>>> implementation detail of the Feather file,
> >> > but
> >> > > > since
> >> > > > > > ZSTD
> >> > > > > > > > > > > > > compression
> >> > > > > > > > > > > > > > >>>>> could improve throughput in Flight, for
> >> > example,
> >> > > > I
> >> > > > > > > > thought I
> >> > > > > > > > > > > > would
> >> > > > > > > > > > > > > > >>>>> bring it up for discussion.
> >> > > > > > > > > > > > > > >>>>>
> >> > > > > > > > > > > > > > >>>>> I can see two simple compression 
> >> > > > > > > > > > > > > > >>>>> strategies:
> >> > > > > > > > > > > > > > >>>>>
> >> > > > > > > > > > > > > > >>>>> * Compress the entire message body in
> >> > one-shot,
> >> > > > > > writing
> >> > > > > > > > the
> >> > > > > > > > > > > > result
> >> > > > > > > > > > > > > > >> out
> >> > > > > > > > > > > > > > >>>>> with an 8-byte int64 prefix indicating the
> >> > > > > > uncompressed
> >> > > > > > > > size
> >> > > > > > > > > > > > > > >>>>> * Compress each non-zero-length constituent
> >> > > > Buffer
> >> > > > > > prior
> >> > > > > > > > to
> >> > > > > > > > > > > > writing
> >> > > > > > > > > > > > > > >> to
> >> > > > > > > > > > > > > > >>>>> the body (and using the same
> >> > > > > > uncompressed-length-prefix
> >> > > > > > > > when
> >> > > > > > > > > > > > > writing
> >> > > > > > > > > > > > > > >>>>> the compressed buffer)
> >> > > > > > > > > > > > > > >>>>>
> >> > > > > > > > > > > > > > >>>>> The latter strategy is preferable for
> >> > scenarios
> >> > > > > > where we
> >> > > > > > > > may
> >> > > > > > > > > > > > > project
> >> > > > > > > > > > > > > > >>>>> out only a few fields from a larger record
> >> > batch
> >> > > > > > (such as
> >> > > > > > > > > > > reading
> >> > > > > > > > > > > > > > >> from
> >> > > > > > > > > > > > > > >>>>> a memory-mapped file).
> >> > > > > > > > > > > > > > >>>>
> >> > > > > > > > > > > > > > >>>> Agreed.  It may also allow using different
> >> > > > compression
> >> > > > > > > > > > > strategies
> >> > > > > > > > > > > > > for
> >> > > > > > > > > > > > > > >>>> different kinds of buffers (for example a
> >> > > > bytestream
> >> > > > > > > > splitting
> >> > > > > > > > > > > > > > strategy
> >> > > > > > > > > > > > > > >>>> for floats and doubles, or a delta encoding
> >> > > > strategy
> >> > > > > > for
> >> > > > > > > > > > > > integers).
> >> > > > > > > > > > > > > > >>>
> >> > > > > > > > > > > > > > >>> If we wanted to allow for different
> >> > compression to
> >> > > > > > apply to
> >> > > > > > > > > > > > different
> >> > > > > > > > > > > > > > >>> buffers, I think we will need a new Message
> >> > type
> >> > > > > > because
> >> > > > > > > > this
> >> > > > > > > > > > > would
> >> > > > > > > > > > > > > > >>> inflate metadata sizes in a way that is not
> >> > likely
> >> > > > to
> >> > > > > > be
> >> > > > > > > > > > > acceptable
> >> > > > > > > > > > > > > > >>> for the current uncompressed use case.
> >> > > > > > > > > > > > > > >>>
> >> > > > > > > > > > > > > > >>> Here is my strawman proposal
> >> > > > > > > > > > > > > > >>>
> >> > > > > > > > > > > > > > >>>
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > >
> >> > > > > >
> >> > > >
> >> > https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> >> > > > > > > > > > > > > > >>>
> >> > > > > > > > > > > > > > >>>>> Implementation could be accomplished by one
> >> > of
> >> > > > the
> >> > > > > > > > following
> >> > > > > > > > > > > > > methods:
> >> > > > > > > > > > > > > > >>>>>
> >> > > > > > > > > > > > > > >>>>> * Setting a field in 
> >> > > > > > > > > > > > > > >>>>> Message.custom_metadata
> >> > > > > > > > > > > > > > >>>>> * Adding a new field to Message
> >> > > > > > > > > > > > > > >>>>
> >> > > > > > > > > > > > > > >>>> I think it has to be a new field in Message.
> >> > > > Making
> >> > > > > > it an
> >> > > > > > > > > > > > ignorable
> >> > > > > > > > > > > > > > >>>> metadata field means non-supporting 
> >> > > > > > > > > > > > > > >>>> receivers
> >> > will
> >> > > > > > decode
> >> > > > > > > > and
> >> > > > > > > > > > > > > > interpret
> >> > > > > > > > > > > > > > >>>> the data wrongly.
> >> > > > > > > > > > > > > > >>>>
> >> > > > > > > > > > > > > > >>>> Regards
> >> > > > > > > > > > > > > > >>>>
> >> > > > > > > > > > > > > > >>>> Antoine.
> >> > > > > > > > > > > > > > >>>
> >> > > > > > > > > > > > > > >>
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > >
> >> > > > > >
> >> > > >
> >> >

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Reply via email to