Karakatiza666 commented on PR #438: URL: https://github.com/apache/arrow-js/pull/438#issuecomment-4518125312
@trxcllnt I appreciate you raising this as I take the spec compliance point seriously, but I want to lay out why I think the full 64-bit indexing is a much larger conversation that goes way beyond the scope of this PR, and is not a prerequisite for landing it. TLDR: The `Number` precision is not the bottleneck for what can be represented using the Large* types, the current project architecture is. Based on my analysis it seems to me you underestimate the amount of combined design, discussion and implementation effort needed to implement full 64-bit indexing. The constraint isn't the offset arithmetic - `BigInt64Array` already stores offsets at full 64-bit precision in this PR, and the helpers around bigint are indeed straightforward. The constraint is on the child elements buffer - the contiguous buffer holding the actual list elements that the offsets point into. All rows of a list-like column in one Arrow batch share a single such buffer, and that buffer is a JavaScript `ArrayBuffer`, which every major engine caps at roughly 2^32 bytes. Since offset values are positions into this buffer, the maximum offset value is bounded by the buffer's element count (~2^32 / sizeof(element)) - many orders of magnitude below 2^53, let alone 2^63. So the int64 offset width is headroom the JS runtime physically can't fill in a single contiguous child buffer. To fully support `LargeList` rows beyond that cap, the child can't be a single `Data`'s elements buffer - it has to be chunked across multiple `Data` objects behind a `Vector`-style rope. That's a change to a load-bearing invariant (`Data.children[0]` is a single contiguous `Data`) that the entire visitor framework relies on, and it would also need to flow into IPC read/write semantics, since the wire writer today assumes one contiguous child per batch. The same reasoning applies to `LargeUtf8` and `LargeBinary`: their backing `values` buffer is a `Uint8Array`, which has the same per-buffer ceiling. So the 53-bit narrowing isn't a quirk introduced by this PR - it's the existing project-wide policy for every Large* type, and matching it is what I meant by "feature parity with List". At the same time the same buffer-size constraint also bounds plain `List`. The spec allows `List` up to 2^31 child elements, but JS's per-`ArrayBuffer` cap means a `List<Float64>` tops out around 2^29 elements - a quarter of the spec ceiling. So the property "JS implementation matches the spec's offset range" isn't currently true for `List` either, and bringing it up to that bar means the same chunked-children redesign, applied across List/LargeList/LargeUtf8/LargeBinary together. That's the design discussion I'd want to open as a separate issue rather than fold it in a wiring PR. Unfortunately, designing and shepherding through a chunked-children rework - across all four types, across the visitor framework, the Builder layer, and the IPC reader/writer - is beyond what I can take on, neither as a standalone PR, and especially not as a part of this PR. I'm comfortable leaving this PR open and rebasing it once the community settles on a design for compliant >2^32-byte child storage that applies uniformly to the Large* family. Otherwise, I maintain that the current PR stands on its own: - it's fully featured; - it introduces no new constraints or regressions and follows the existing implementation and API surface conventions; - is valid and very useful to the community to close the feature gap despite not being fully compliant on the maximum element count - today the library throws when it encounters a `LargeList`. P.S. conforming producers should rebase the offsets in a batch, so in practice as long as the batch spans no more elements that can fit in the child buffer the current 2^53 arithmetic is unused headroom. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
