On Thu, 28 Jan 2021 18:19:00 +0000 Joris Peeters <joris.mg.peet...@gmail.com> wrote:
> To be fair, I'm happy to apply it at IPC level. Just didn't realise that > was a thing. IIUC what Antoine suggests, though, then just (leaving Python > as-is and) changing my Java to > > var is = new FileInputStream(path.toFile()); > var reader = new ArrowStreamReader(is, allocator); > var schema = reader.getVectorSchemaRoot().getSchema(); > > (i.e. just get rid of the lz4 input stream) should work, i.e. let the > reader figure it out? I see no option to specify the compression in the > reader, so it might detect it? You would specify the compression in the *writer* (in the Python side), using the *options* argument here: https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream or here: https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_file.html#pyarrow.ipc.new_file (unfortunately, it seems we didn't document IpcWriteOptions, but you can inspect it on the Python prompt: >>> pa.ipc.IpcWriteOptions? Init signature: pa.ipc.IpcWriteOptions(self, /, *args, **kwargs) Docstring: IpcWriteOptions(metadata_version=MetadataVersion.V5, *, use_legacy_format=False, compression=None, bool use_threads=True, bool emit_dictionary_deltas=False) Serialization options for the IPC format. Parameters ---------- metadata_version : MetadataVersion, default MetadataVersion.V5 The metadata version to write. V5 is the current and latest, V4 is the pre-1.0 metadata version (with incompatible Union layout). use_legacy_format : bool, default False Whether to use the pre-Arrow 0.15 IPC format. compression: str or None If not None, compression codec to use for record batch buffers. May only be "lz4", "zstd" or None. use_threads: bool Whether to use the global CPU thread pool to parallelize any computational tasks like compression. emit_dictionary_deltas: bool Whether to emit dictionary deltas. Default is false for maximum stream compatibility. ) Regards Antoine.