Karakatiza666 commented on PR #438:
URL: https://github.com/apache/arrow-js/pull/438#issuecomment-4518125312

   @trxcllnt I appreciate you raising this as I take the spec compliance point 
seriously, but I want to lay out why I think the full 64-bit indexing is a much 
larger conversation that goes way beyond the scope of this PR, and is not a 
prerequisite for landing it.
   
   TLDR: The `Number` precision is not the bottleneck for what can be 
represented using the Large* types, the current project architecture is.
   
   Based on my analysis it seems to me you underestimate the amount of combined 
design, discussion and implementation effort needed to implement full 64-bit 
indexing.
   The constraint isn't the offset arithmetic - `BigInt64Array` already stores 
offsets at full 64-bit precision in this PR, and the helpers around bigint are 
indeed straightforward.
   
   The constraint is on the child elements buffer - the contiguous buffer 
holding the actual list elements that the offsets point into. All rows of a 
list-like column in one Arrow batch share a single such buffer, and that buffer 
is a JavaScript `ArrayBuffer`, which every major engine caps at roughly 2^32 
bytes. Since offset values are positions into this buffer, the maximum offset 
value is bounded by the buffer's element count (~2^32 / sizeof(element)) - many 
orders of magnitude below 2^53, let alone 2^63. So the int64 offset width is 
headroom the JS runtime physically can't fill in a single contiguous child 
buffer.
   
   To fully support `LargeList` rows beyond that cap, the child can't be a 
single `Data`'s elements buffer - it has to be chunked across multiple `Data` 
objects behind a `Vector`-style rope. That's a change to a load-bearing 
invariant (`Data.children[0]` is a single contiguous `Data`) that the entire 
visitor framework relies on, and it would also need to flow into IPC read/write 
semantics, since the wire writer today assumes one contiguous child per batch.
   
   The same reasoning applies to `LargeUtf8` and `LargeBinary`: their backing 
`values` buffer is a `Uint8Array`, which has the same per-buffer ceiling. So 
the 53-bit narrowing isn't a quirk introduced by this PR - it's the existing 
project-wide policy for every Large* type, and matching it is what I meant by 
"feature parity with List".
   
   At the same time the same buffer-size constraint also bounds plain `List`. 
The spec allows `List` up to 2^31 child elements, but JS's per-`ArrayBuffer` 
cap means a `List<Float64>` tops out around 2^29 elements - a quarter of the 
spec ceiling. So the property "JS implementation matches the spec's offset 
range" isn't currently true for `List` either, and bringing it up to that bar 
means the same chunked-children redesign, applied across 
List/LargeList/LargeUtf8/LargeBinary together. That's the design discussion I'd 
want to open as a separate issue rather than fold it in a wiring PR.
   
   Unfortunately, designing and shepherding through a chunked-children rework - 
across all four types, across the visitor framework, the Builder layer, and the 
IPC reader/writer - is beyond what I can take on, neither as a standalone PR, 
and especially not as a part of this PR. I'm comfortable leaving this PR open 
and rebasing it once the community settles on a design for compliant >2^32-byte 
child storage that applies uniformly to the Large* family.
   
   Otherwise, I maintain that the current PR stands on its own:
   - it's fully featured;
   - it introduces no new constraints or regressions and follows the existing 
implementation and API surface conventions;
   - is valid and very useful to the community to close the feature gap despite 
not being fully compliant on the maximum element count - today the library 
throws when it encounters a `LargeList`.
   
   P.S. conforming producers should rebase the offsets in a batch, so in 
practice as long as the batch spans no more elements that can fit in the child 
buffer the current 2^53 arithmetic is unused headroom.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to