leaves12138 commented on PR #7673: URL: https://github.com/apache/paimon/pull/7673#issuecomment-4456787443
Thanks for the update. I reviewed the latest patch (`9e596a1f3337`). The overall direction makes sense: pushing `SlicedSplit` ranges into Lance/Vortex readers avoids reading the full file and then slicing in Python. However, I think there is still a correctness issue before merging. **Blocking issue: native partial reads can produce wrong `_ROW_ID` values.** For Vortex, `FormatVortexReader` now reads only the selected local offsets through `scan(..., indices=...)`. After that, the batch still goes through `DataFileBatchReader`, which assigns `_ROW_ID` as a continuous sequence starting from `file.first_row_id`: ```python arrays[idx] = pa.array(range(self.first_row_id, self.first_row_id + record_batch.num_rows), type=pa.int64()) self.first_row_id += record_batch.num_rows ``` This was correct when Vortex read the full file and `ShardBatchReader` sliced afterwards, because row-id assignment happened before slicing. With the new native pushdown, a shard like `[200, 700)` will read only 500 rows, but `_ROW_ID` will be generated as `[file.first_row_id, file.first_row_id + 500)`, instead of `[file.first_row_id + 200, file.first_row_id + 700)`. So scans that project `_ROW_ID` together with `with_shard` / `with_slice` can return incorrect metadata. The same class of issue can also affect Lance `read_range`, and it is even more subtle for `IndexedSplit` / `take_rows`, because the result rows may be sparse and cannot be represented by a single shifted `first_row_id`. For the sparse case, we probably need either a row-id vector aware path, or fall back to reading/filtering in a way that preserves row-position semantics when `_ROW_ID` is requested or deletion-vector position semantics depend on original file offsets. Could you please add a regression test for this? For example, use a row-tracking-enabled Vortex/Lance table, write one file with many rows, run `with_shard(1, 3)` or `with_slice(200, 700)` with projection including `_ROW_ID`, and assert that the returned `_ROW_ID` values are shifted by the slice start. This test should fail with the current implementation. A smaller follow-up concern: `vortex.array(range(shard_range[0], shard_range[1]))` materializes an index array for a contiguous shard. That may be expensive for large files. If Vortex does not expose a native range/slice scan API, this may be acceptable for now, but it would be good to document the trade-off or add a benchmark/guardrail. I only ran syntax validation locally because this environment does not have the required Python dependencies (`pyarrow`, `pandas`, `lance`, `vortex`) and the Vortex tests require Python 3.11+. `compileall` passed for the modified reader files and `split_read.py`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
