Re: [PR] GH-24833 Implement IPC RecordBatch body buffer compression [arrow-js]

via GitHub Wed, 21 May 2025 09:08:08 -0700


trxcllnt commented on code in PR #14:
URL: https://github.com/apache/arrow-js/pull/14#discussion_r2100670090



##########
src/ipc/reader.ts:
##########
@@ -369,9 +389,51 @@ abstract class RecordBatchReaderImpl<T extends TypeMap = 
any> implements RecordB
             new Vector(data)) :
             new Vector(data)).memoize() as Vector;
     }
-    protected _loadVectors(header: metadata.RecordBatch, body: any, types: 
(Field | DataType)[]) {
+    protected _loadVectors(header: metadata.RecordBatch, body: Uint8Array, 
types: (Field | DataType)[]) {
         return new VectorLoader(body, header.nodes, header.buffers, 
this.dictionaries, this.schema.metadataVersion).visitMany(types);
     }
+
+    private _decompressBuffers(header: metadata.RecordBatch, body: Uint8Array, 
codec: Codec): { decommpressedBody: Uint8Array; buffers: 
metadata.BufferRegion[] } {
+        const decompressedBuffers: Uint8Array[] = [];
+        const newBufferRegions: metadata.BufferRegion[] = [];
+
+        let currentOffset = 0;
+        for (const { offset, length } of header.buffers) {
+            if (length === 0) {
+                decompressedBuffers.push(new Uint8Array(0));
+                newBufferRegions.push(new metadata.BufferRegion(currentOffset, 
0));
+                continue;
+            }
+            const byteBuf = new flatbuffers.ByteBuffer(body.subarray(offset, 
offset + length));
+            const uncompressedLenth = bigIntToNumber(byteBuf.readInt64(0));
+
+
+            const bytes = byteBuf.bytes().subarray(LENGTH_OF_PREFIX_DATA);
+
+            const decompressed = (uncompressedLenth === 
LENGTH_NO_COMPRESSED_DATA)
+                ? bytes
+                : codec.decode!(bytes);
+
+            decompressedBuffers.push(decompressed);
+
+            const padding = (DEFAULT_ALIGNMENT - (currentOffset % 
DEFAULT_ALIGNMENT)) % DEFAULT_ALIGNMENT;
+            currentOffset += padding;
+            newBufferRegions.push(new metadata.BufferRegion(currentOffset, 
decompressed.length));
+            currentOffset += decompressed.length;
+        }
+
+        const totalSize = currentOffset;
+        const combined = new Uint8Array(totalSize);
+
+        for (const [i, decompressedBuffer] of decompressedBuffers.entries()) {
+            combined.set(decompressedBuffer, newBufferRegions[i].offset);

Review Comment:
   I think that might be more complicated than necessary.
   
   IIUC, the new logic loops through all buffers, decompresses them, and 
collects them into a list. Then it packs all the decompressed buffers into a 
contiguous ArrayBuffer that matches the equivalent IPC format without 
compression.
   
   In order to avoid the last step of re-packing into an ArrayBuffer, we'd need 
to return the list of uncompressed buffers and use a `VectorLoader` instance 
that accepts the list and selects the buffers by index (vs. the current 
behavior which accepts the contiguous ArrayBuffer and slices from it). Luckily, 
that's exactly what the 
[`JSONVectorLoader`](https://github.com/apache/arrow-js/blob/ea5593ac8bbf66be114c25f9227082f6dec93e91/src/visitor/vectorloader.ts#L151-L194)
 does!
   
   I don't think you can use the `JSONVectorLoader` directly, since it assumes 
the list of buffers are JSON-encoded representations of the values, but you 
could implement a new `CompressedVectorLoader` class that closely follows its 
structure but doesn't call methods like `packBools()` and 
`binaryDataFromJSON()`.
   
   The logic in your function here would need to also return a list of 
`BufferRegion` instances whose `offset` field corresponds to the Array index of 
each decompressed buffer (rather than the `byteOffset` of each buffer in the 
contiguous ArrayBuffer).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] GH-24833 Implement IPC RecordBatch body buffer compression [arrow-js]

Reply via email to