[ 
https://issues.apache.org/jira/browse/ARROW-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522941#comment-17522941
 ] 

Kyle Barron edited comment on ARROW-8674 at 4/18/22 3:26 PM:
-------------------------------------------------------------

Hello! I'd like to revisit this issue and potentially submit a PR for this.

I think there are various reasons why we might not want to pull in LZ4 and ZSTD 
implementations by default:
 * Bundle-size conscious users who don't want any codecs, or who might not use 
the arrow IPC features at all. The WASM codecs in 
[numcodecs.js|https://github.com/manzt/numcodecs.js] appear to be 17.1KB for 
LZ4 and 206KB for ZSTD (uncompressed).
 * Some users may prefer dynamically importing codecs as required but this 
requires a slightly more complex setup (at least it requires choosing a CDN 
from which to import the bundle, right?)
 * I came across at least 4 LZ4 implementations and at least 6 ZSTD 
implementations. It could be better to leave to the user the choice of which 
implementation to use. If the user is using one implementation in their app 
already, then allowing the user to choose the same implementation in Arrow JS 
would reduce their bundle size.
 * At least one LZ4 implementation is in [pure 
JS|https://github.com/Benzinga/lz4js], with no WASM components. Some users may 
prefer a pure JS library for simplicity.

How would others feel about a codec registry system? Something like what 
[Zarr.js allows|http://guido.io/zarr.js/#/installation?id=zarrjs-core-export], 
where you can [dynamically register 
codecs|https://github.com/gzuidhof/zarr.js/blob/29280463ff2f275c31c1fa0f002daa947b8f09b2/src/compression/registry.ts]
 on demand.

The `arrow.tableFromIPC` function is currently synchronous, so unless we 
changed that function to be async, we wouldn't be able to import the codec 
_after_ seeing that a data file has a given compression, because a dynamic 
import would have to be async.

In terms of implementation, I'd expect it to be relatively straightforward? 
Presumably look to update `decodeBuffers` here: 
[https://github.com/apache/arrow/blob/b67e3c8ef1e173e1840c4fa897b7c6c493932e10/js/src/ipc/metadata/message.ts#L303].

 

References:

LZ4 implementations:
 * [https://github.com/gorhill/lz4-wasm] Edit: Looks like this is LZ4 block 
format only, whereas we need the LZ4 frame format.
 * [https://github.com/manzt/numcodecs.js/tree/main/codecs/lz4] 
 * [https://www.npmjs.com/package/lz4-wasm]
 * [https://github.com/Benzinga/lz4js] 

ZSTD implementations:
 * [https://github.com/manzt/numcodecs.js/tree/main/codecs/zstd] 
 * [https://github.com/bokuweb/zstd-wasm]
 * [https://github.com/yoshihitoh/zstd-codec]
 * [https://github.com/donmccurdy/zstddec] 
 * [https://github.com/fabiospampinato/zstandard-wasm]
 * [https://github.com/OneIdentity/zstd-js] 


was (Author: kylebarron2):
Hello! I'd like to revisit this issue and potentially submit a PR for this.

I think there are various reasons why we might not want to pull in LZ4 and ZSTD 
implementations by default:
 * Bundle-size conscious users who don't want any codecs, or who might not use 
the arrow IPC features at all. The WASM codecs in 
[numcodecs.js|https://github.com/manzt/numcodecs.js] appear to be 17.1KB for 
LZ4 and 206KB for ZSTD (uncompressed).
 * Some users may prefer dynamically importing codecs as required but this 
requires a slightly more complex setup (at least it requires choosing a CDN 
from which to import the bundle, right?)
 * I came across at least 4 LZ4 implementations and at least 6 ZSTD 
implementations. It could be better to leave to the user the choice of which 
implementation to use. If the user is using one implementation in their app 
already, then allowing the user to choose the same implementation in Arrow JS 
would reduce their bundle size.
 * At least one LZ4 implementation is in [pure 
JS|https://github.com/Benzinga/lz4js], with no WASM components. Some users may 
prefer a pure JS library for simplicity.

How would others feel about a codec registry system? Something like what 
[Zarr.js allows|http://guido.io/zarr.js/#/installation?id=zarrjs-core-export], 
where you can [dynamically register 
codecs|https://github.com/gzuidhof/zarr.js/blob/29280463ff2f275c31c1fa0f002daa947b8f09b2/src/compression/registry.ts]
 on demand.

The `arrow.tableFromIPC` function is currently synchronous, so unless we 
changed that function to be async, we wouldn't be able to import the codec 
_after_ seeing that a data file has a given compression, because a dynamic 
import would have to be async.

In terms of implementation, I'd expect it to be relatively straightforward? 
Presumably look to update `decodeBuffers` here: 
https://github.com/apache/arrow/blob/b67e3c8ef1e173e1840c4fa897b7c6c493932e10/js/src/ipc/metadata/message.ts#L303.

 

References:

LZ4 implementations:
 * [https://github.com/gorhill/lz4-wasm] 
 * [https://github.com/manzt/numcodecs.js/tree/main/codecs/lz4] 
 * [https://www.npmjs.com/package/lz4-wasm]
 * [https://github.com/Benzinga/lz4js] 

ZSTD implementations:
 * [https://github.com/manzt/numcodecs.js/tree/main/codecs/zstd] 
 * [https://github.com/bokuweb/zstd-wasm]
 * [https://github.com/yoshihitoh/zstd-codec]
 * [https://github.com/donmccurdy/zstddec] 
 * [https://github.com/fabiospampinato/zstandard-wasm]
 * [https://github.com/OneIdentity/zstd-js] 

> [JS] Implement IPC RecordBatch body buffer compression from ARROW-300
> ---------------------------------------------------------------------
>
>                 Key: ARROW-8674
>                 URL: https://issues.apache.org/jira/browse/ARROW-8674
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: JavaScript
>            Reporter: Wes McKinney
>            Priority: Major
>
> This may not be a hard requirement for JS because this would require pulling 
> in implementations of LZ4 and ZSTD which not all users may want



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to