Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

Antoine Pitrou Mon, 22 Mar 2021 07:46:36 -0700


Le 22/03/2021 à 15:29, Benjamin Wilhelm a écrit :

Also, I would like to resume the discussion about the Frame format vs the
Block format. There were 3 points for the Frame format by Antoine:

- it allows streaming compression and decompression (meaning you can

avoid loading a huge compressed buffer at once)

It seems like this is not used anywhere. Doesn't it make more sense to use
more record batches if one buffer in a record batch gets too big?

It does. But that depends who emitted the record batches. Perhaps you'rereceiving data written out by a large machine and trying to process iton a small embedded client? I'm not sure this example makes sense or isinteresting at all.

- it embeds the decompressed size, allowing exact allocation of the
decompressed buffer

Micah pointed out that this is already part of the IPC specification.


Ah, indeed. Then this point is moot.

- it has an optional checksum

  Wouldn't it make sense to have a higher level checksum (as already
mentioned by Antoine) if we want to have checksums at all? Just having a
checksum in case of one specific compression does not make a lot of sense
to me.


Yes, it would certainly make sense.

Given these points, I think that the Frame format is not more useful for
compressing Arrow Buffers than the Block format. It only adds unnecessary
overhead in metadata.
Are there any other points for the Frame format that I missed? If not, what
would it mean to switch to the Block format? (Or add the Block format as an
option??)

Well, the problem is that by now this spec is public and has beenimplemented since Arrow C++ 2.0.

Fortunately, we named the compression method "LZ4_FRAME" so we could add"LZ4_RAW". But it seems a bit wasteful and confusing to allow both eventhough we agree there is no advantage to either one.


Regards

Antoine.

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

Reply via email to