Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

Micah Kornfield Mon, 22 Mar 2021 09:13:48 -0700

>
> I executed some of the benchmarks in the airlift/aircompressor project. I
> found that aircompressior achieves on average only about 72%
> throughput compared to the current version of the lz4-java JNI bindings
> when compressing. When decompressing the gap is even bigger with around 56%
> throughout. See the following google sheet for the benchmark results.
>
> https://docs.google.com/spreadsheets/d/1mT1qmpvV25YcRmPz4IYxXyPSzUsMsdovN7gc_vmac5U/edit?usp=sharing



I think it is fine if we want to use lz4-java bindings for compression as
this will be a proper subset of LZ4 compression.

Could you share the benchmark code/how the benchmark was run (does this
account for JIT warm-up time)?

-Micah

On Mon, Mar 22, 2021 at 7:46 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Le 22/03/2021 à 15:29, Benjamin Wilhelm a écrit :
> > Also, I would like to resume the discussion about the Frame format vs the
> > Block format. There were 3 points for the Frame format by Antoine:
> >
> > - it allows streaming compression and decompression (meaning you can
> >> avoid loading a huge compressed buffer at once)
> >>
> > It seems like this is not used anywhere. Doesn't it make more sense to
> use
> > more record batches if one buffer in a record batch gets too big?
>
> It does. But that depends who emitted the record batches. Perhaps you're
> receiving data written out by a large machine and trying to process it
> on a small embedded client? I'm not sure this example makes sense or is
> interesting at all.
>
> >> - it embeds the decompressed size, allowing exact allocation of the
> >> decompressed buffer
> >>
> > Micah pointed out that this is already part of the IPC specification.
>
> Ah, indeed. Then this point is moot.
>
> >> - it has an optional checksum
> >>
> >   Wouldn't it make sense to have a higher level checksum (as already
> > mentioned by Antoine) if we want to have checksums at all? Just having a
> > checksum in case of one specific compression does not make a lot of sense
> > to me.
>
> Yes, it would certainly make sense.
>
> > Given these points, I think that the Frame format is not more useful for
> > compressing Arrow Buffers than the Block format. It only adds unnecessary
> > overhead in metadata.
> > Are there any other points for the Frame format that I missed? If not,
> what
> > would it mean to switch to the Block format? (Or add the Block format as
> an
> > option??)
>
> Well, the problem is that by now this spec is public and has been
> implemented since Arrow C++ 2.0.
>
> Fortunately, we named the compression method "LZ4_FRAME" so we could add
> "LZ4_RAW".  But it seems a bit wasteful and confusing to allow both even
> though we agree there is no advantage to either one.
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

Reply via email to