[
https://issues.apache.org/jira/browse/ARROW-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523786#comment-17523786
]
Kyle Barron commented on ARROW-8674:
------------------------------------
Thanks for the feedback!
{quote}We also should look into how much benefit we actually get from
compression since most servers already support transparent gzip compression and
so compressing an already compressed file will just incur overhead.
{quote}
I think there are several reasons why it's important to support compressed
files:
* Popular tools in the ecosystem write data with compression turned on by
default. I'm specifically looking at Pyarrow/Pandas, which [writes
LZ4-compressed files by
default|https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html#pyarrow.feather.write_feather].
If a web app wants to display arrow data from unknown sources, having some way
to load all files is ideal.
* It's true that servers usually offer transparent gzip compression, but there
are reasons why a user wouldn't want that. For one, gzip compression is much
slower than LZ4 or ZSTD compression. An example below using [this
file|https://ookla-open-data.s3.us-west-2.amazonaws.com/parquet/performance/type=mobile/year=2019/quarter=1/2019-01-01_performance_mobile_tiles.parquet]
of writing a 753MB Arrow table to a memory buffer uncompressed, then using the
standard library's `gzip.compress` took {*}2m46s{*}. The Python interface is
slower than the gzip command line, but `time gzip -c uncompressed_table.arrow >
/dev/null` still took {*}36s{*}. Meanwhile, using LZ4 output took only *1.48s*
and ZSTD output took only {*}1.63s{*}. In this example, the LZ4 file was 75%
larger than the gzip file, but the ZSTD one was 6% smaller than the gzip one.
Of course this is just one example, but it at least gives credence to times
when a developer would prefer lz4 or zstd over gzip.
* I think supporting compression in `tableToIPC` would be quite valuable for
any use case where an app wants to push Arrow data to a server.
{quote}{color:#172b4d}Looking at lz4js, it's so small{color}{color:#172b4d}
that it's probably okay to pull in a dependency by default. {color}
{quote}
{color:#172b4d}Wow that is impressively small. It might make sense to pull that
in by default. The issue tracker is mostly empty, though there is one report of
data compressed by lz4js not being able to be read by other tools.{color}
{quote}I definitely don't want to pull in wasm into the library as it will
break people's workflows.
{quote}
I agree. I'm fine with not pulling in a wasm library by default.
{quote}Could you look at the available js libraries and see what their sizes
are? Also, is lz4 or zstd much more common than the other?
{quote}
None of the ZSTD libraries I came across were pure JS. The only LZ4 one that
was pure JS was lz4js. Aside from considering something like trying to
transpile wasm to JS, which I think would be too complex for arrow JS, the only
default I see that's possible is using lz4js while also supporting a registry.
I don't know if LZ4 or ZSTD are more common. LZ4 is the default for Pyarrow
when writing a table.
{quote}If the libraries are too heavy, we can think about a plugin system. We
could make our registry be synchronous.
{quote}
I think it would be possible to force the `compress` and `decompress` functions
in the plugin system to be synchronous. That would just force the user to
finish any async initialization before trying to read/write a file, since wasm
bundles can't be instantiated synchronously I think.
------
Example of writing table to buffer uncompressed, then using `gzip.compress`
from the Python standard library
```
In [37]: %%time
...: options = pa.ipc.IpcWriteOptions(compression=None)
...: with pa.BufferOutputStream() as buf:
...: with pa.ipc.new_stream(buf, table.schema, options=options) as
writer:
...: writer.write_table(table)
...:
...: reader = pa.BufferReader(buf.getvalue())
...: reader.seek(0)
...: out = gzip.compress(reader.read())
...: print(len(out))
...:
175807183
CPU times: user 2min 41s, sys: 1.74 s, total: 2min 43s
Wall time: 2min 46s
```
Example of writing table to buffer with lz4 compression:
```
In [40]: %%time
...: options = pa.ipc.IpcWriteOptions(compression='lz4')
...: with pa.BufferOutputStream() as buf:
...: with pa.ipc.new_stream(buf, table.schema, options=options) as
writer:
...: writer.write_table(table)
...:
...: print(buf.tell())
313078576
CPU times: user 1.48 s, sys: 322 ms, total: 1.81 s
Wall time: 1.48 s
```
Example of writing table to buffer with zstd compression:
```
In [41]: %%time
...: options = pa.ipc.IpcWriteOptions(compression='zstd')
...: with pa.BufferOutputStream() as buf:
...: with pa.ipc.new_stream(buf, table.schema, options=options) as
writer:
...: writer.write_table(table)
...:
...: print(buf.tell())
166563176
CPU times: user 2.28 s, sys: 178 ms, total: 2.45 s
Wall time: 1.63 s
```
> [JS] Implement IPC RecordBatch body buffer compression from ARROW-300
> ---------------------------------------------------------------------
>
> Key: ARROW-8674
> URL: https://issues.apache.org/jira/browse/ARROW-8674
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: JavaScript
> Reporter: Wes McKinney
> Priority: Major
>
> This may not be a hard requirement for JS because this would require pulling
> in implementations of LZ4 and ZSTD which not all users may want
--
This message was sent by Atlassian Jira
(v8.20.1#820001)