Karakatiza666 opened a new pull request, #439:
URL: https://github.com/apache/arrow-js/pull/439

   This PR was co-authored with [Claude Code](https://claude.com/claude-code).
   
   ---
   
   ## Fix incorrect 64-bit JSON IPC output for tables loaded from binary Arrow
   
   ### The bug
   When the JSON IPC writer serializes a column whose values are stored as 
64-bit words (offsets or data), it reads the buffer's underlying ArrayBuffer 
directly without honoring the typed-array's byteOffset / byteLength. If the 
column happens to be a view into a larger shared buffer rather than a 
freshly-allocated array, the writer reads from the wrong window of memory and 
emits garbage for the affected DATA / OFFSET strings.
   
   ### When it actually triggers
   Only on the flow "read data in binary Arrow IPC, write data in JSON IPC." 
That's the only public path that produces vectors backed by offset views — the 
binary IPC reader carves every buffer out of a single shared byte array, so 
anything loaded from a file, stream, or another Arrow implementation arrives at 
the JSON writer in exactly the at-risk shape.
   
   Tables built directly in JS memory (e.g. via the existing JSON round-trip 
tests, or tableFromArrays) are not affected, which is why the bug must have 
gone unnoticed: the RecordBatchJSONWriter test suite synthesizes its inputs in 
memory and never sees the offset-view state. The flow is mostly used in 
cross-language integration testing, not by typical end users — which may be the 
second reason it lingered.
   
   ### What's affected
   Any column whose values are 64-bit words processed through the writer's 
bignum helper:
   
   - Int64, Uint64
   - DateMillisecond
   - All four Timestamp units
   - TimeMicrosecond, TimeNanosecond
   - All four Duration units
   - Decimal128
   - LargeUtf8 and LargeBinary (their OFFSET arrays)
   
   Fixed types unaffected: 32-bit ints/floats, DateDay, TimeSecond/Millisecond, 
regular Utf8/Binary, anything that doesn't go through the 64-bit serialization 
path.
   
   ### Test coverage
   
   A new `test/unit/ipc/writer/json-offset-views-tests.ts` adds 16 focused unit 
tests — one per type whose JSON serialization goes through the buggy 64-bit 
helper. Each test takes a freshly-generated vector from the shared 
generate-test-data helpers, rewraps every backing typed array at a nonzero 
byteOffset, and asserts the JSON writer produces the same output for the 
rewrapped vector as for the original. The list of covered types is a flat table 
at the top of the file; **adding a future affected type is a one-line 
addition**.
   
   The existing `RecordBatchJSONWriter` suite in 
`test/unit/ipc/writer/json-writer-tests.ts` was also extended: it now routes 
its generated source tables through binary IPC before the JSON round-trip, so 
the JSON writer sees the offset-view typed arrays that real-world inputs always 
carry. This widens end-to-end coverage to every type the random/dictionary 
table generators produce, on the user-facing API path.
   
   The two layers are complementary:
   
   The focused unit tests pinpoint exactly which type fails on regression and 
stay sound even if the binary IPC reader's memory layout ever changes — they 
construct offset views directly, without depending on `tableFromIPC`.
   The extended round-trip suite catches the same regression on a much broader 
type matrix through the actual public flow a user would use.
   With the fix reverted, all 16 focused tests fail and over 100 cases in the 
extended round-trip suite fail. With the fix applied, all 7000+ tests across 46 
suites pass.
   
   The new test file is not strictly necessary right now, but it makes catching 
a regression more robust.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to