The application level compression Java support for compression is being worked on (I would need to double check if the PR has been merged) and I don't think its been integration tested with C++/Python I would imagine it would run into a similar issue with not being able to decode linked blocks.
On Thu, Jan 28, 2021 at 10:19 AM Joris Peeters <joris.mg.peet...@gmail.com> wrote: > To be fair, I'm happy to apply it at IPC level. Just didn't realise that > was a thing. IIUC what Antoine suggests, though, then just (leaving Python > as-is and) changing my Java to > > var is = new FileInputStream(path.toFile()); > var reader = new ArrowStreamReader(is, allocator); > var schema = reader.getVectorSchemaRoot().getSchema(); > > (i.e. just get rid of the lz4 input stream) should work, i.e. let the > reader figure it out? I see no option to specify the compression in the > reader, so it might detect it? This, however, gives, > > java.io.IOException: Unexpected end of stream trying to read message. > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:700) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63) > > FWIW - and this makes sense now that I understand there's a difference > between IPC compression and full stream compression - writing it in > Python à la, > > fh = io.BytesIO() > writer = pa.RecordBatchStreamWriter(fh, table.schema) > writer.write_table(table) > writer.close() > bytes_ = fh.getvalue() > compressed_bytes = lz4.frame.compress(bytes_, compression_level=3, > block_linked=False) > with open(path, 'wb') as fh: > fh.write(compressed_bytes) > > works fine with the Java from the original email. > > -J > > > On Thu, Jan 28, 2021 at 6:06 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> It might be worth opening up an issue with the lz4-java library. This >> seems like the java implementation doesn't fully support the LZ4 stream >> protocol? >> >> Antoine in this case it looks like Joris is applying the compression and >> decompression at the file level NOT the IPC level. >> >> On Thu, Jan 28, 2021 at 10:01 AM Antoine Pitrou <anto...@python.org> >> wrote: >> >> > >> > Le 28/01/2021 à 17:59, Joris Peeters a écrit : >> > > From Python, I'm dumping an LZ4-compressed arrow stream to a file, >> using >> > > >> > > with pa.output_stream(path, compression = 'lz4') as fh: >> > > writer = pa.RecordBatchStreamWriter(fh, table.schema) >> > > writer.write_table(table) >> > > writer.close() >> > > >> > > I then try reading this file from Java, starting with >> > > >> > > var is = new LZ4FrameInputStream(new >> FileInputStream(path.toFile())); >> > > >> > > using the lz4-java library. That fails, however, with >> > >> > Well, that sounds expected. LZ4 compression in the IPC format does not >> > work by compressing the whole stream. Instead, buffers in the stream >> > are compressed individually, while metadata is uncompressed. >> > >> > So, you needn't wrap the stream with LZ4 yourself. Instead, just let >> > the Java implementation of Arrow handle compression. It *should* work. >> > >> > Regards >> > >> > Antoine. >> > >> >