leaves12138 commented on PR #7673:
URL: https://github.com/apache/paimon/pull/7673#issuecomment-4456787443

   Thanks for the update. I reviewed the latest patch (`9e596a1f3337`). The 
overall direction makes sense: pushing `SlicedSplit` ranges into Lance/Vortex 
readers avoids reading the full file and then slicing in Python. However, I 
think there is still a correctness issue before merging.
   
   **Blocking issue: native partial reads can produce wrong `_ROW_ID` values.**
   
   For Vortex, `FormatVortexReader` now reads only the selected local offsets 
through `scan(..., indices=...)`. After that, the batch still goes through 
`DataFileBatchReader`, which assigns `_ROW_ID` as a continuous sequence 
starting from `file.first_row_id`:
   
   ```python
   arrays[idx] = pa.array(range(self.first_row_id, self.first_row_id + 
record_batch.num_rows), type=pa.int64())
   self.first_row_id += record_batch.num_rows
   ```
   
   This was correct when Vortex read the full file and `ShardBatchReader` 
sliced afterwards, because row-id assignment happened before slicing. With the 
new native pushdown, a shard like `[200, 700)` will read only 500 rows, but 
`_ROW_ID` will be generated as `[file.first_row_id, file.first_row_id + 500)`, 
instead of `[file.first_row_id + 200, file.first_row_id + 700)`. So scans that 
project `_ROW_ID` together with `with_shard` / `with_slice` can return 
incorrect metadata.
   
   The same class of issue can also affect Lance `read_range`, and it is even 
more subtle for `IndexedSplit` / `take_rows`, because the result rows may be 
sparse and cannot be represented by a single shifted `first_row_id`. For the 
sparse case, we probably need either a row-id vector aware path, or fall back 
to reading/filtering in a way that preserves row-position semantics when 
`_ROW_ID` is requested or deletion-vector position semantics depend on original 
file offsets.
   
   Could you please add a regression test for this? For example, use a 
row-tracking-enabled Vortex/Lance table, write one file with many rows, run 
`with_shard(1, 3)` or `with_slice(200, 700)` with projection including 
`_ROW_ID`, and assert that the returned `_ROW_ID` values are shifted by the 
slice start. This test should fail with the current implementation.
   
   A smaller follow-up concern: `vortex.array(range(shard_range[0], 
shard_range[1]))` materializes an index array for a contiguous shard. That may 
be expensive for large files. If Vortex does not expose a native range/slice 
scan API, this may be acceptable for now, but it would be good to document the 
trade-off or add a benchmark/guardrail.
   
   I only ran syntax validation locally because this environment does not have 
the required Python dependencies (`pyarrow`, `pandas`, `lance`, `vortex`) and 
the Vortex tests require Python 3.11+. `compileall` passed for the modified 
reader files and `split_read.py`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to