Aha, OK! Thanks for the help all. I'll keep an eye on the Java side for the IPC compression, but for my current purpose doing full stream compression is totally fine.
On Thu, Jan 28, 2021 at 6:22 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > The application level compression Java support for compression is being > worked on (I would need to double check if the PR has been merged) and I > don't think its been integration tested with C++/Python I would imagine it > would run into a similar issue with not being able to decode linked blocks. > > On Thu, Jan 28, 2021 at 10:19 AM Joris Peeters <joris.mg.peet...@gmail.com> > wrote: > >> To be fair, I'm happy to apply it at IPC level. Just didn't realise that >> was a thing. IIUC what Antoine suggests, though, then just (leaving Python >> as-is and) changing my Java to >> >> var is = new FileInputStream(path.toFile()); >> var reader = new ArrowStreamReader(is, allocator); >> var schema = reader.getVectorSchemaRoot().getSchema(); >> >> (i.e. just get rid of the lz4 input stream) should work, i.e. let the >> reader figure it out? I see no option to specify the compression in the >> reader, so it might detect it? This, however, gives, >> >> java.io.IOException: Unexpected end of stream trying to read message. >> at >> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:700) >> at >> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57) >> at >> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164) >> at >> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170) >> at >> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161) >> at >> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63) >> >> FWIW - and this makes sense now that I understand there's a difference >> between IPC compression and full stream compression - writing it in >> Python à la, >> >> fh = io.BytesIO() >> writer = pa.RecordBatchStreamWriter(fh, table.schema) >> writer.write_table(table) >> writer.close() >> bytes_ = fh.getvalue() >> compressed_bytes = lz4.frame.compress(bytes_, compression_level=3, >> block_linked=False) >> with open(path, 'wb') as fh: >> fh.write(compressed_bytes) >> >> works fine with the Java from the original email. >> >> -J >> >> >> On Thu, Jan 28, 2021 at 6:06 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> It might be worth opening up an issue with the lz4-java library. This >>> seems like the java implementation doesn't fully support the LZ4 stream >>> protocol? >>> >>> Antoine in this case it looks like Joris is applying the compression and >>> decompression at the file level NOT the IPC level. >>> >>> On Thu, Jan 28, 2021 at 10:01 AM Antoine Pitrou <anto...@python.org> >>> wrote: >>> >>> > >>> > Le 28/01/2021 à 17:59, Joris Peeters a écrit : >>> > > From Python, I'm dumping an LZ4-compressed arrow stream to a file, >>> using >>> > > >>> > > with pa.output_stream(path, compression = 'lz4') as fh: >>> > > writer = pa.RecordBatchStreamWriter(fh, table.schema) >>> > > writer.write_table(table) >>> > > writer.close() >>> > > >>> > > I then try reading this file from Java, starting with >>> > > >>> > > var is = new LZ4FrameInputStream(new >>> FileInputStream(path.toFile())); >>> > > >>> > > using the lz4-java library. That fails, however, with >>> > >>> > Well, that sounds expected. LZ4 compression in the IPC format does not >>> > work by compressing the whole stream. Instead, buffers in the stream >>> > are compressed individually, while metadata is uncompressed. >>> > >>> > So, you needn't wrap the stream with LZ4 yourself. Instead, just let >>> > the Java implementation of Arrow handle compression. It *should* work. >>> > >>> > Regards >>> > >>> > Antoine. >>> > >>> >>