[
https://issues.apache.org/jira/browse/ARROW-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522941#comment-17522941
]
Kyle Barron commented on ARROW-8674:
------------------------------------
Hello! I'd like to revisit this issue and potentially submit a PR for this.
I think there are various reasons why we might not want to pull in LZ4 and ZSTD
implementations by default:
* Bundle-size conscious users who don't want any codecs, or who might not use
the arrow IPC features at all. The WASM codecs in
[numcodecs.js|https://github.com/manzt/numcodecs.js] appear to be 17.1KB for
LZ4 and 206KB for ZSTD (uncompressed).
* Some users may prefer dynamically importing codecs as required but this
requires a slightly more complex setup (at least it requires choosing a CDN
from which to import the bundle, right?)
* I came across at least 4 LZ4 implementations and at least 6 ZSTD
implementations. It could be better to leave to the user the choice of which
implementation to use. If the user is using one implementation in their app
already, then allowing the user to choose the same implementation in Arrow JS
would reduce their bundle size.
* At least one LZ4 implementation is in [pure
JS|https://github.com/Benzinga/lz4js], with no WASM components. Some users may
prefer a pure JS library for simplicity.
How would others feel about a codec registry system? Something like what
[Zarr.js allows|http://guido.io/zarr.js/#/installation?id=zarrjs-core-export],
where you can [dynamically register
codecs|https://github.com/gzuidhof/zarr.js/blob/29280463ff2f275c31c1fa0f002daa947b8f09b2/src/compression/registry.ts]
on demand.
The `arrow.tableFromIPC` function is currently synchronous, so unless we
changed that function to be async, we wouldn't be able to import the codec
_after_ seeing that a data file has a given compression, because a dynamic
import would have to be async.
In terms of implementation, I'd expect it to be relatively straightforward?
Presumably look to update `decodeBuffers` here:
https://github.com/apache/arrow/blob/b67e3c8ef1e173e1840c4fa897b7c6c493932e10/js/src/ipc/metadata/message.ts#L303.
References:
LZ4 implementations:
* [https://github.com/gorhill/lz4-wasm]
* [https://github.com/manzt/numcodecs.js/tree/main/codecs/lz4]
* [https://www.npmjs.com/package/lz4-wasm]
* [https://github.com/Benzinga/lz4js]
ZSTD implementations:
* [https://github.com/manzt/numcodecs.js/tree/main/codecs/zstd]
* [https://github.com/bokuweb/zstd-wasm]
* [https://github.com/yoshihitoh/zstd-codec]
* [https://github.com/donmccurdy/zstddec]
* [https://github.com/fabiospampinato/zstandard-wasm]
* [https://github.com/OneIdentity/zstd-js]
> [JS] Implement IPC RecordBatch body buffer compression from ARROW-300
> ---------------------------------------------------------------------
>
> Key: ARROW-8674
> URL: https://issues.apache.org/jira/browse/ARROW-8674
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: JavaScript
> Reporter: Wes McKinney
> Priority: Major
>
> This may not be a hard requirement for JS because this would require pulling
> in implementations of LZ4 and ZSTD which not all users may want
--
This message was sent by Atlassian Jira
(v8.20.1#820001)