jding-xyz opened a new pull request, #152:
URL: https://github.com/apache/arrow-swift/pull/152

   ## Problem
   
   `VariableBufferBuilder<T>` (used by both `StringArrayBuilder` and 
`BinaryArrayBuilder`) has several correctness issues that can cause memory 
corruption and produce Arrow-invalid output:
   
   1. **First-value buffer overflow.** The `append()` method only checks values 
buffer capacity inside `if index > 0`, so the first non-nil value can overflow 
the initial 64-byte allocation if it is larger than that. This corrupts memory 
silently.
   
   2. **Null slots write phantom bytes.** Null entries create a 4-byte zero 
placeholder and copy it into the values buffer. Per the Arrow columnar format, 
null variable-width slots should repeat the previous offset and append zero 
value bytes.
   
   3. **Offsets buffer is sized for N entries instead of N+1.** The Arrow spec 
requires N+1 offsets for N rows. The old `finish()` creates an offsets buffer 
with `length` entries, and the reader similarly expects `length` offsets, 
leading to an off-by-one in the offset buffer.
   
   4. **Logical buffer lengths are never updated.** `values.length` and 
`offsets.length` stay at their initial values (0) after construction. 
`finish()` then uses stale lengths — in particular, the values buffer is 
created with `length = 0` regardless of actual content.
   
   5. **Resize checks the wrong metric.** `value_resize()` compares against 
`values.length` (stale logical length) instead of `values.capacity` (actual 
allocation).
   
   These issues were discovered while using `StringArrayBuilder` in a 
production IPC-based data pipeline, where the first string value in a column 
was typically longer than 64 bytes, causing silent memory corruption during 
batch construction.
   
   ## Fix
   
   Rewrote `VariableBufferBuilder<T>.append(_:)` around explicit Arrow 
invariants rather than patching individual branches:
   
   - Read `currentOffset` from the offsets buffer uniformly for all indices 
(including row 0), instead of special-casing the first row.
   - For non-null values: compute `nextOffset = currentOffset + byteCount`, 
ensure values buffer capacity, copy bytes.
   - For null values: keep `nextOffset = currentOffset`, do not write any bytes.
   - Store the next offset, then update `self.length`, `nulls.length`, 
`offsets.length`, and `values.length` consistently after every append.
   - Initialize the offsets buffer with a single `Int32(0)` entry at 
construction (the required first offset).
   
   Updated the corresponding resize and finish methods:
   - `value_resize()` now checks `capacity` instead of `length`.
   - `resize()` decouples null bitmap and offset buffer growth, checks against 
`capacity`, and preserves logical lengths across reallocation.
   - `finish()` creates the offsets buffer with `length + 1` entries.
   
   Propagated the N+1 offset convention to the reader and C data importer:
   - `ArrowReader`: offset buffer length is now `node.length + 1`; values 
buffer length is derived from the last offset entry.
   - `ArrowCImporter`: offset buffer length changed from `length` to `length + 
1`.
   - `ArrowData` convenience init: for variable-width types, derives array 
length as `offsets.length - 1`.
   
   ## Testing
   
   Added targeted regression tests with raw-buffer assertions (not just 
same-library round-trip checks):
   
   - **Long first value**: 256-byte first string, verifies offsets = `[0, 256]` 
and values buffer length = 256.
   - **Null does not advance offsets**: `["a", nil, "bbb"]` verifies offsets = 
`[0, 1, 1, 4]` and values buffer length = 4.
   - **Null first, then long value**: `[nil, 512-byte string]` verifies offsets 
= `[0, 0, 512]`.
   - **Binary parity**: same null-handling test for `BinaryArrayBuilder`.
   - **IPC round-trip**: `[256-byte string, nil, "tail"]` through 
`writeStreaming` / `readStreaming`, verifying schema, row count, decoded 
values, and null handling.
   
   All existing tests continue to pass. The 5 pre-existing failures on `main` 
(missing generated test data files) are unrelated.
   
   ## AI disclosure
   
   This fix was developed with AI assistance (Cursor with Claude). The bug was 
identified through real-world usage in a project that uses Arrow IPC for data 
serialization — string columns with longer first values caused memory 
corruption during batch construction. The root cause was then traced by manual 
code inspection of `VariableBufferBuilder.append()` against the Arrow columnar 
format spec, and documented in a detailed plan before any code was written. All 
generated code was reviewed line-by-line, traced through concrete examples, and 
validated against `swift test`. The approach — writing regression tests first, 
then fixing the implementation — was deliberate to ensure the tests could catch 
regressions independently of the fix.
   
   Made with [Cursor](https://cursor.com)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to