Le 22/03/2021 à 15:29, Benjamin Wilhelm a écrit :
Also, I would like to resume the discussion about the Frame format vs the
Block format. There were 3 points for the Frame format by Antoine:
- it allows streaming compression and decompression (meaning you can
avoid loading a huge compressed buffer at once)
It seems like this is not used anywhere. Doesn't it make more sense to use
more record batches if one buffer in a record batch gets too big?
It does. But that depends who emitted the record batches. Perhaps you're
receiving data written out by a large machine and trying to process it
on a small embedded client? I'm not sure this example makes sense or is
interesting at all.
- it embeds the decompressed size, allowing exact allocation of the
decompressed buffer
Micah pointed out that this is already part of the IPC specification.
Ah, indeed. Then this point is moot.
- it has an optional checksum
Wouldn't it make sense to have a higher level checksum (as already
mentioned by Antoine) if we want to have checksums at all? Just having a
checksum in case of one specific compression does not make a lot of sense
to me.
Yes, it would certainly make sense.
Given these points, I think that the Frame format is not more useful for
compressing Arrow Buffers than the Block format. It only adds unnecessary
overhead in metadata.
Are there any other points for the Frame format that I missed? If not, what
would it mean to switch to the Block format? (Or add the Block format as an
option??)
Well, the problem is that by now this spec is public and has been
implemented since Arrow C++ 2.0.
Fortunately, we named the compression method "LZ4_FRAME" so we could add
"LZ4_RAW". But it seems a bit wasteful and confusing to allow both even
though we agree there is no advantage to either one.
Regards
Antoine.