[jira] [Commented] (ARROW-8674) [JS] Implement IPC RecordBatch body buffer compression from ARROW-300

Kyle Barron (Jira) Mon, 18 Apr 2022 09:43:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523786#comment-17523786
 ]


Kyle Barron commented on ARROW-8674:
------------------------------------

Thanks for the feedback!
{quote}We also should look into how much benefit we actually get from 
compression since most servers already support transparent gzip compression and 
so compressing an already compressed file will just incur overhead.
{quote}
I think there are several reasons why it's important to support compressed 
files:
 * Popular tools in the ecosystem write data with compression turned on by 
default. I'm specifically looking at Pyarrow/Pandas, which [writes 
LZ4-compressed files by 
default|https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html#pyarrow.feather.write_feather].
 If a web app wants to display arrow data from unknown sources, having some way 
to load all files is ideal.
 * It's true that servers usually offer transparent gzip compression, but there 
are reasons why a user wouldn't want that. For one, gzip compression is much 
slower than LZ4 or ZSTD compression. An example below using [this 
file|https://ookla-open-data.s3.us-west-2.amazonaws.com/parquet/performance/type=mobile/year=2019/quarter=1/2019-01-01_performance_mobile_tiles.parquet]
 of writing a 753MB Arrow table to a memory buffer uncompressed, then using the 
standard library's `gzip.compress` took {*}2m46s{*}. The Python interface is 
slower than the gzip command line, but `time gzip -c uncompressed_table.arrow > 
/dev/null` still took {*}36s{*}. Meanwhile, using LZ4 output took only *1.48s* 
and ZSTD output took only {*}1.63s{*}. In this example, the LZ4 file was 75% 
larger than the gzip file, but the ZSTD one was 6% smaller than the gzip one. 
Of course this is just one example, but it at least gives credence to times 
when a developer would prefer lz4 or zstd over gzip.
 * I think supporting compression in `tableToIPC` would be quite valuable for 
any use case where an app wants to push Arrow data to a server.

{quote}{color:#172b4d}Looking at lz4js, it's so small{color}{color:#172b4d} 
that it's probably okay to pull in a dependency by default. {color}
{quote}
{color:#172b4d}Wow that is impressively small. It might make sense to pull that 
in by default. The issue tracker is mostly empty, though there is one report of 
data compressed by lz4js not being able to be read by other tools.{color}
{quote}I definitely don't want to pull in wasm into the library as it will 
break people's workflows.
{quote}
I agree. I'm fine with not pulling in a wasm library by default.
{quote}Could you look at the available js libraries and see what their sizes 
are? Also, is lz4 or zstd much more common than the other?
{quote}
None of the ZSTD libraries I came across were pure JS. The only LZ4 one that 
was pure JS was lz4js. Aside from considering something like trying to 
transpile wasm to JS, which I think would be too complex for arrow JS, the only 
default I see that's possible is using lz4js while also supporting a registry. 
I don't know if LZ4 or ZSTD are more common. LZ4 is the default for Pyarrow 
when writing a table.
{quote}If the libraries are too heavy, we can think about a plugin system. We 
could make our registry be synchronous.
{quote}
I think it would be possible to force the `compress` and `decompress` functions 
in the plugin system to be synchronous. That would just force the user to 
finish any async initialization before trying to read/write a file, since wasm 
bundles can't be instantiated synchronously I think.

 

------

Example of writing table to buffer uncompressed, then using `gzip.compress` 
from the Python standard library

```

In [37]: %%time
    ...: options = pa.ipc.IpcWriteOptions(compression=None)
    ...: with pa.BufferOutputStream() as buf:
    ...:     with pa.ipc.new_stream(buf, table.schema, options=options) as 
writer:
    ...:         writer.write_table(table)
    ...:
    ...:     reader = pa.BufferReader(buf.getvalue())
    ...:     reader.seek(0)
    ...:     out = gzip.compress(reader.read())
    ...:     print(len(out))
    ...:
175807183
CPU times: user 2min 41s, sys: 1.74 s, total: 2min 43s
Wall time: 2min 46s

```

Example of writing table to buffer with lz4 compression:

```

In [40]: %%time
    ...: options = pa.ipc.IpcWriteOptions(compression='lz4')
    ...: with pa.BufferOutputStream() as buf:
    ...:     with pa.ipc.new_stream(buf, table.schema, options=options) as 
writer:
    ...:         writer.write_table(table)
    ...:
    ...:     print(buf.tell())
313078576
CPU times: user 1.48 s, sys: 322 ms, total: 1.81 s
Wall time: 1.48 s

```

Example of writing table to buffer with zstd compression:

```

In [41]: %%time
    ...: options = pa.ipc.IpcWriteOptions(compression='zstd')
    ...: with pa.BufferOutputStream() as buf:
    ...:     with pa.ipc.new_stream(buf, table.schema, options=options) as 
writer:
    ...:         writer.write_table(table)
    ...:
    ...:     print(buf.tell())
166563176
CPU times: user 2.28 s, sys: 178 ms, total: 2.45 s
Wall time: 1.63 s

```

> [JS] Implement IPC RecordBatch body buffer compression from ARROW-300
> ---------------------------------------------------------------------
>
>                 Key: ARROW-8674
>                 URL: https://issues.apache.org/jira/browse/ARROW-8674
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: JavaScript
>            Reporter: Wes McKinney
>            Priority: Major
>
> This may not be a hard requirement for JS because this would require pulling 
> in implementations of LZ4 and ZSTD which not all users may want



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-8674) [JS] Implement IPC RecordBatch body buffer compression from ARROW-300

Reply via email to