trxcllnt commented on code in PR #14: URL: https://github.com/apache/arrow-js/pull/14#discussion_r2100670090
########## src/ipc/reader.ts: ########## @@ -369,9 +389,51 @@ abstract class RecordBatchReaderImpl<T extends TypeMap = any> implements RecordB new Vector(data)) : new Vector(data)).memoize() as Vector; } - protected _loadVectors(header: metadata.RecordBatch, body: any, types: (Field | DataType)[]) { + protected _loadVectors(header: metadata.RecordBatch, body: Uint8Array, types: (Field | DataType)[]) { return new VectorLoader(body, header.nodes, header.buffers, this.dictionaries, this.schema.metadataVersion).visitMany(types); } + + private _decompressBuffers(header: metadata.RecordBatch, body: Uint8Array, codec: Codec): { decommpressedBody: Uint8Array; buffers: metadata.BufferRegion[] } { + const decompressedBuffers: Uint8Array[] = []; + const newBufferRegions: metadata.BufferRegion[] = []; + + let currentOffset = 0; + for (const { offset, length } of header.buffers) { + if (length === 0) { + decompressedBuffers.push(new Uint8Array(0)); + newBufferRegions.push(new metadata.BufferRegion(currentOffset, 0)); + continue; + } + const byteBuf = new flatbuffers.ByteBuffer(body.subarray(offset, offset + length)); + const uncompressedLenth = bigIntToNumber(byteBuf.readInt64(0)); + + + const bytes = byteBuf.bytes().subarray(LENGTH_OF_PREFIX_DATA); + + const decompressed = (uncompressedLenth === LENGTH_NO_COMPRESSED_DATA) + ? bytes + : codec.decode!(bytes); + + decompressedBuffers.push(decompressed); + + const padding = (DEFAULT_ALIGNMENT - (currentOffset % DEFAULT_ALIGNMENT)) % DEFAULT_ALIGNMENT; + currentOffset += padding; + newBufferRegions.push(new metadata.BufferRegion(currentOffset, decompressed.length)); + currentOffset += decompressed.length; + } + + const totalSize = currentOffset; + const combined = new Uint8Array(totalSize); + + for (const [i, decompressedBuffer] of decompressedBuffers.entries()) { + combined.set(decompressedBuffer, newBufferRegions[i].offset); Review Comment: I think that might be more complicated than necessary. IIUC, the new logic loops through all buffers, decompresses them, and collects them into a list. Then it packs all the decompressed buffers into a contiguous ArrayBuffer that matches the equivalent IPC format without compression. In order to avoid the last step of re-packing into an ArrayBuffer, we'd need to return the list of uncompressed buffers and use a `VectorLoader` instance that accepts the list and selects the buffers by index (vs. the current behavior which accepts the contiguous ArrayBuffer and slices from it). Luckily, that's exactly what the [`JSONVectorLoader`](https://github.com/apache/arrow-js/blob/ea5593ac8bbf66be114c25f9227082f6dec93e91/src/visitor/vectorloader.ts#L151-L194) does! I don't think you can use the `JSONVectorLoader` directly, since it assumes the list of buffers are JSON-encoded representations of the values, but you could implement a new `CompressedVectorLoader` class that closely follows its structure but doesn't call methods like `packBools()` and `binaryDataFromJSON()`. The logic in your function here would need to also return a list of `BufferRegion` instances whose `offset` field corresponds to the Array index of each decompressed buffer (rather than the `byteOffset` of each buffer in the contiguous ArrayBuffer). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org