Re: lz4 compressed arrow between Python & Java

Antoine Pitrou Thu, 28 Jan 2021 10:30:16 -0800

On Thu, 28 Jan 2021 18:19:00 +0000
Joris Peeters <joris.mg.peet...@gmail.com> wrote:


> To be fair, I'm happy to apply it at IPC level. Just didn't realise that
> was a thing. IIUC what Antoine suggests, though, then just (leaving Python
> as-is and) changing my Java to
> 
>     var is = new FileInputStream(path.toFile());
>     var reader = new ArrowStreamReader(is, allocator);
>     var schema = reader.getVectorSchemaRoot().getSchema();
> 
> (i.e. just get rid of the lz4 input stream) should work, i.e. let the
> reader figure it out? I see no option to specify the compression in the
> reader, so it might detect it?

You would specify the compression in the *writer* (in the Python side),
using the *options* argument here:
https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream
or here:
https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_file.html#pyarrow.ipc.new_file

(unfortunately, it seems we didn't document IpcWriteOptions, but you
can inspect it on the Python prompt:

>>> pa.ipc.IpcWriteOptions?
Init signature: pa.ipc.IpcWriteOptions(self, /, *args, **kwargs)
Docstring:     
IpcWriteOptions(metadata_version=MetadataVersion.V5, *,
use_legacy_format=False, compression=None, bool use_threads=True, bool
emit_dictionary_deltas=False) Serialization options for the IPC format.

    Parameters
    ----------
    metadata_version : MetadataVersion, default MetadataVersion.V5
        The metadata version to write.  V5 is the current and latest,
        V4 is the pre-1.0 metadata version (with incompatible Union
    layout). use_legacy_format : bool, default False
        Whether to use the pre-Arrow 0.15 IPC format.
    compression: str or None
        If not None, compression codec to use for record batch buffers.
        May only be "lz4", "zstd" or None.
    use_threads: bool
        Whether to use the global CPU thread pool to parallelize any
        computational tasks like compression.
    emit_dictionary_deltas: bool
        Whether to emit dictionary deltas.  Default is false for maximum
        stream compatibility.
    
)

Regards

Antoine.

Re: lz4 compressed arrow between Python & Java

Reply via email to