zanmato1984 opened a new pull request, #49602: URL: https://github.com/apache/arrow/pull/49602
## Rationale for this change Issue #49392 reports a user-visible corruption when filtering a table that contains a `list<double>` column: slicing the last row returns the expected values, while filtering the same row returns values from a different child span. The corruption only appears once the selected child-value range is large enough, which points to an overflow in the fixed-width gather path used when list filtering materializes the selected child values. ## What changes are included in this PR? This patch moves fixed-width gather byte-offset scaling onto an explicit `int64_t` helper before the `memcpy` and `memset` address calculations. That fixes the underlying 32-bit byte-offset overflow when a `uint32` gather index is multiplied by the fixed value width. In the source issue's last-row case, the selected child span starts at `999998000`; for `double` values, scaling that index by 8 bytes wrapped in 32-bit arithmetic and redirected the gather to the wrong child span. Keeping the byte-offset arithmetic in 64 bits makes the fixed-width gather path, the child `Take()` call used under list filtering, and the final filtered `Table` all address the correct bytes. To validate that this is the same bug reported in the issue, I also used a local near-e2e C++ reproduction that keeps the issue's logical shape (`N=500000`, `ARRAY_LEN=2000`, an `id` column, and a `numbers: list<double>` column), filters the last row, and seeds both the true target child span and the pre-fix wrapped span with distinct sentinels. In that setup, `Slice()` returns the expected row, a replay of the pre-fix gather arithmetic returns the wrapped sentinel span instead, and the fixed child `Take()` and table `Filter()` results both match the sliced row. That ties the user-visible issue and this root-cause fix back to the same overflow boundary. ## Are these changes tested? Yes. The patch adds a targeted unit test that checks fixed-width gather byte offsets are computed with 64-bit arithmetic. This is intentionally smaller than an end-to-end filter regression: the original user-visible failure only shows up at very large logical offsets, while the new unit test isolates the exact overflow boundary directly without constructing a huge table or depending on a PyArrow-level reproduction. That makes it smaller, more stable, and more appropriate for regular C++ CI, while the near-e2e local reproduction was used separately to confirm that this root-cause regression test and the reported filtering corruption are exercising the same bug. ## Are there any user-facing changes? Yes. Filtering tables with large list columns backed by fixed-width child values no longer risks returning data from a wrapped byte offset. * GitHub Issue: #49392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
