> > I executed some of the benchmarks in the airlift/aircompressor project. I > found that aircompressior achieves on average only about 72% > throughput compared to the current version of the lz4-java JNI bindings > when compressing. When decompressing the gap is even bigger with around 56% > throughout. See the following google sheet for the benchmark results. > > https://docs.google.com/spreadsheets/d/1mT1qmpvV25YcRmPz4IYxXyPSzUsMsdovN7gc_vmac5U/edit?usp=sharing
I think it is fine if we want to use lz4-java bindings for compression as this will be a proper subset of LZ4 compression. Could you share the benchmark code/how the benchmark was run (does this account for JIT warm-up time)? -Micah On Mon, Mar 22, 2021 at 7:46 AM Antoine Pitrou <anto...@python.org> wrote: > > Le 22/03/2021 à 15:29, Benjamin Wilhelm a écrit : > > Also, I would like to resume the discussion about the Frame format vs the > > Block format. There were 3 points for the Frame format by Antoine: > > > > - it allows streaming compression and decompression (meaning you can > >> avoid loading a huge compressed buffer at once) > >> > > It seems like this is not used anywhere. Doesn't it make more sense to > use > > more record batches if one buffer in a record batch gets too big? > > It does. But that depends who emitted the record batches. Perhaps you're > receiving data written out by a large machine and trying to process it > on a small embedded client? I'm not sure this example makes sense or is > interesting at all. > > >> - it embeds the decompressed size, allowing exact allocation of the > >> decompressed buffer > >> > > Micah pointed out that this is already part of the IPC specification. > > Ah, indeed. Then this point is moot. > > >> - it has an optional checksum > >> > > Wouldn't it make sense to have a higher level checksum (as already > > mentioned by Antoine) if we want to have checksums at all? Just having a > > checksum in case of one specific compression does not make a lot of sense > > to me. > > Yes, it would certainly make sense. > > > Given these points, I think that the Frame format is not more useful for > > compressing Arrow Buffers than the Block format. It only adds unnecessary > > overhead in metadata. > > Are there any other points for the Frame format that I missed? If not, > what > > would it mean to switch to the Block format? (Or add the Block format as > an > > option??) > > Well, the problem is that by now this spec is public and has been > implemented since Arrow C++ 2.0. > > Fortunately, we named the compression method "LZ4_FRAME" so we could add > "LZ4_RAW". But it seems a bit wasteful and confusing to allow both even > though we agree there is no advantage to either one. > > Regards > > Antoine. >