Re: lz4 compressed arrow between Python & Java

Micah Kornfield Thu, 28 Jan 2021 10:23:02 -0800

The application level compression Java support for compression is being
worked on (I would need to double check if the PR has been  merged) and I
don't think its been integration tested with C++/Python  I would imagine it
would run into a similar issue with not being able to decode linked blocks.


On Thu, Jan 28, 2021 at 10:19 AM Joris Peeters <[email protected]>
wrote:

> To be fair, I'm happy to apply it at IPC level. Just didn't realise that
> was a thing. IIUC what Antoine suggests, though, then just (leaving Python
> as-is and) changing my Java to
>
>     var is = new FileInputStream(path.toFile());
>     var reader = new ArrowStreamReader(is, allocator);
>     var schema = reader.getVectorSchemaRoot().getSchema();
>
> (i.e. just get rid of the lz4 input stream) should work, i.e. let the
> reader figure it out? I see no option to specify the compression in the
> reader, so it might detect it? This, however, gives,
>
>     java.io.IOException: Unexpected end of stream trying to read message.
>     at
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:700)
>     at
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57)
>     at
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164)
>     at
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170)
>     at
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161)
>     at
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63)
>
> FWIW - and this makes sense now that I understand there's a difference
> between IPC compression and full stream compression - writing it in
> Python à la,
>
>     fh = io.BytesIO()
>     writer = pa.RecordBatchStreamWriter(fh, table.schema)
>     writer.write_table(table)
>     writer.close()
>     bytes_ = fh.getvalue()
>     compressed_bytes = lz4.frame.compress(bytes_, compression_level=3,
> block_linked=False)
>     with open(path, 'wb') as fh:
>         fh.write(compressed_bytes)
>
> works fine with the Java from the original email.
>
> -J
>
>
> On Thu, Jan 28, 2021 at 6:06 PM Micah Kornfield <[email protected]>
> wrote:
>
>> It might be worth opening up an issue with the lz4-java library.  This
>> seems like the java implementation doesn't fully support the LZ4 stream
>> protocol?
>>
>> Antoine in this case it looks like Joris is applying the compression and
>> decompression at the file level NOT the IPC level.
>>
>> On Thu, Jan 28, 2021 at 10:01 AM Antoine Pitrou <[email protected]>
>> wrote:
>>
>> >
>> > Le 28/01/2021 à 17:59, Joris Peeters a écrit :
>> > > From Python, I'm dumping an LZ4-compressed arrow stream to a file,
>> using
>> > >
>> > >     with pa.output_stream(path, compression = 'lz4') as fh:
>> > >         writer = pa.RecordBatchStreamWriter(fh, table.schema)
>> > >         writer.write_table(table)
>> > >         writer.close()
>> > >
>> > > I then try reading this file from Java, starting with
>> > >
>> > >     var is = new LZ4FrameInputStream(new
>> FileInputStream(path.toFile()));
>> > >
>> > > using the lz4-java library. That fails, however, with
>> >
>> > Well, that sounds expected.  LZ4 compression in the IPC format does not
>> > work by compressing the whole stream.  Instead, buffers in the stream
>> > are compressed individually, while metadata is uncompressed.
>> >
>> > So, you needn't wrap the stream with LZ4 yourself.  Instead, just let
>> > the Java implementation of Arrow handle compression.  It *should* work.
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>>
>

Re: lz4 compressed arrow between Python & Java

Reply via email to