[
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15608905#comment-15608905
]
Julien Le Dem commented on ARROW-300:
-------------------------------------
I'm thinking that we don't really need to compress each buffer independently
and compression could be just an encapsulation at the transport level. It
sounds like we don't want to exchange compressed buffers in memory (without
sending them on the wire/disk).
In Parquet, columns can be decompressed independently because they can be
retrieved independently. In Arrow, the entire RecordBatch corresponds to a
request and will be entirely compressed and decompressed every-time. Which
means we can just have the entire batch compressed together.
For simplicity I'd vote to not have compression in the Schema metadata.
https://github.com/apache/arrow/blob/2f84493371bd8fae30b8e042984c9d6ba5419c5f/format/Message.fbs#L186
That's one less thing to worry about for implementors.
We can have compression in transport level (RPC, file format, ...)
As for the supported compressors I would vote for SNAPPY and GZIP (zlib) to
start with as they provide the 2 options you describe (higher comp or higher
throughput) and SNAPPY is easier to use from Java than LZO (lz4).
> [Format] Add buffer compression option to IPC file format
> ---------------------------------------------------------
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Format
> Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer
> compression setting in the file Footer. Probably only two compressors worth
> supporting out of the box would be zlib (higher compression ratios) and lz4
> (better performance).
> What does everyone think?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)