Djjanks opened a new pull request, #14:
URL: https://github.com/apache/arrow-js/pull/14

   There is a replica of the PR in the old monorepository [PR 
#46493](https://github.com/apache/arrow/pull/46493)
   
   ### Rationale for this change
   This change introduces support for reading compressed Arrow IPC streams in 
JavaScript. The primary motivation is the need to read Arrow IPC Stream in the 
browser when they are transmitted over the network in a compressed format to 
reduce network load.
   
   Several reasons support this enhancement:
   - Personal need in other project to read compressed arrow IPC stream.
   - Community demand, as seen in [Issue 
#24833](https://github.com/apache/arrow/issues/24833).
   - A similar implementation was attempted in [PR 
#13076](https://github.com/apache/arrow/pull/13076) but was never merged. I am 
very grateful to @kylebarron .
   - Other language implementations (e.g., C++, Python, Rust) already support 
IPC compression.
   
   ### What changes are included in this PR?
   - Support for decoding compressed RecordBatch buffers during reading.
   - Each buffer is decompressed individually, offsets are recalculated with 
8-byte alignment, and a new metadata. RecordBatch is constructed before loading 
vectors.
   - Only decompression is implemented; compression (writing) is not supported 
yet.
   - Currently tested with the lz4 codec using the lz4js library. lz4-wasm was 
evaluated but rejected due to incompatibility with LZ4 Frame format.
   - The decompression logic is isolated to _loadRecordBatch() in the 
RecordBatchReaderImpl class.
   - A codec.decode function is retrieved from the compressionRegistry and 
applied per-buffer. So users can choose suitable lib.
   
   #### Additional notes:
   1. Codec compatibility caveats
   Not all JavaScript LZ4 libraries are compatible with the Arrow IPC format. 
For example:
   - lz4js works correctly as it supports the LZ4 Frame Format.
   - lz4-wasm is not compatible, as it expects raw LZ4 blocks and fails to 
decompress LZ4 frame data.
   This can result in silent or cryptic errors. To improve developer 
experience, we could:
   - Wrap codec.decode calls in try/catch and surface a clearer error message 
if decompression fails.
   - Add an optional check in compressionRegistry.set() to validate that the 
codec supports LZ4 Frame Format. One way would be to compress dummy data and 
inspect the first 4 bytes for the expected LZ4 Frame magic header (0x04 0x22 
0x4D 0x18).
   2. Reconstruction of metadata.RecordBatch
   After decompressing the buffers, new BufferRegion entries are calculated to 
match the uncompressed data layout. A new metadata.RecordBatch is constructed 
with the updated buffer regions and passed into _loadVectors().
   This introduces a mutation-like pattern that may break assumptions in the 
current design. However, it's necessary because:
   - _loadVectors() depends strictly on the offsets in header.buffers, which no 
longer match the decompressed buffer layout.
   - Without changing either _loadVectors() or metadata.RecordBatch, the 
current approach is the least intrusive.
   3. Setting compression = null in new RecordBatch
   When reconstructing the metadata, the compression field is explicitly set to 
null, since the data is already decompressed in memory.
   This decision is somewhat debatable — feedback is welcome on whether it's 
better to retain the original compression metadata or to reflect the current 
state of the buffer (uncompressed). The current implementation assumes the 
latter.
   
   ### Are these changes tested?
   - The changes were tested in the own project using LZ4-compressed Arrow 
stream.
   - Test uncompressed, compressed and pseudo compressed(uncompressed data 
length = -1) data. 
   - No unit tests are included in this PR yet.
   - The decompression was verified with real-world data and the lz4js codec 
(lz4-wasm is not compatible).
   - No issues were observed with alignment, vector loading, or decompression 
integrity.
   - Exception handling is not yet added around codec.decode. This may be 
useful for catching codec incompatibility and providing better user feedback.
   
   ### Are there any user-facing changes?
   Yes, Arrow JS users can now read compressed IPC stream, assuming they 
register an appropriate codec using compressionRegistry.set().
   
   Example:
   ```ts
   import { Codec, compressionRegistry } from 'apache-arrow';
   import * as lz4 from 'lz4js';
   
     const lz4Codec: Codec = {
         encode(data: Uint8Array): Uint8Array { return lz4js.compress(data) },
         decode(data: Uint8Array): Uint8Array { return lz4js.decompress(data) }
     }; 
   
     compressionRegistry.set(CompressionType.LZ4_FRAME, lz4Codec);
   ```
   This change does not affect writing or serialization.
   
   **This PR includes breaking changes to public APIs.**
   No. The change adds functionality but does not modify any existing API 
behavior.
   
   **This PR contains a "Critical Fix".**
   No. This is a new feature, not a critical fix.
   
   ### Checklist
   
   - [x] All tests pass (`yarn test`)
   - [x] Build completes (`yarn build`)
   - [ ] I have added a new test for compressed batches
   * GitHub Issue: #24833


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to