zanmato1984 opened a new pull request, #49602:
URL: https://github.com/apache/arrow/pull/49602

   ## Rationale for this change
   
   Issue #49392 reports a user-visible corruption when filtering a table that 
contains a `list<double>` column: slicing the last row returns the expected 
values, while filtering the same row returns values from a different child 
span. The corruption only appears once the selected child-value range is large 
enough, which points to an overflow in the fixed-width gather path used when 
list filtering materializes the selected child values.
   
   ## What changes are included in this PR?
   
   This patch moves fixed-width gather byte-offset scaling onto an explicit 
`int64_t` helper before the `memcpy` and `memset` address calculations. That 
fixes the underlying 32-bit byte-offset overflow when a `uint32` gather index 
is multiplied by the fixed value width. In the source issue's last-row case, 
the selected child span starts at `999998000`; for `double` values, scaling 
that index by 8 bytes wrapped in 32-bit arithmetic and redirected the gather to 
the wrong child span. Keeping the byte-offset arithmetic in 64 bits makes the 
fixed-width gather path, the child `Take()` call used under list filtering, and 
the final filtered `Table` all address the correct bytes.
   
   To validate that this is the same bug reported in the issue, I also used a 
local near-e2e C++ reproduction that keeps the issue's logical shape 
(`N=500000`, `ARRAY_LEN=2000`, an `id` column, and a `numbers: list<double>` 
column), filters the last row, and seeds both the true target child span and 
the pre-fix wrapped span with distinct sentinels. In that setup, `Slice()` 
returns the expected row, a replay of the pre-fix gather arithmetic returns the 
wrapped sentinel span instead, and the fixed child `Take()` and table 
`Filter()` results both match the sliced row. That ties the user-visible issue 
and this root-cause fix back to the same overflow boundary.
   
   ## Are these changes tested?
   
   Yes. The patch adds a targeted unit test that checks fixed-width gather byte 
offsets are computed with 64-bit arithmetic. This is intentionally smaller than 
an end-to-end filter regression: the original user-visible failure only shows 
up at very large logical offsets, while the new unit test isolates the exact 
overflow boundary directly without constructing a huge table or depending on a 
PyArrow-level reproduction. That makes it smaller, more stable, and more 
appropriate for regular C++ CI, while the near-e2e local reproduction was used 
separately to confirm that this root-cause regression test and the reported 
filtering corruption are exercising the same bug.
   
   ## Are there any user-facing changes?
   
   Yes. Filtering tables with large list columns backed by fixed-width child 
values no longer risks returning data from a wrapped byte offset.
   
   * GitHub Issue: #49392


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to