leaves12138 commented on PR #7673:
URL: https://github.com/apache/paimon/pull/7673#issuecomment-4508990995

   Thanks for the update. I re-checked the latest patch (06650b515329). The 
no-predicate cases look much better now: native Lance shard/range pushdown and 
IndexedSplit sparse _ROW_ID assignment are covered by the new tests, and I was 
able to run the Lance affected tests locally after installing pylance.
   
   However, I think there is still a blocking correctness/runtime issue when 
native row slicing is combined with a file-level predicate and _ROW_ID 
projection.
   
   The current flow is:
   
   - split_read.py passes row_indices / shard_range into the Lance/Vortex 
format readers.
   - FormatLanceReader applies read_range() / take_rows() first, and then 
applies the Arrow predicate on the sliced in-memory table.
   - FormatVortexReader passes both expr and indices into scan().
   - DataFileBatchReader then assigns _ROW_ID by assuming the returned rows are 
still the original contiguous slice, or the prefix of row_id_offsets.
   
   That assumption no longer holds after predicate filtering. For example, with 
a Lance append-only file containing ids 0..999 and row tracking enabled, 
reading with projection [id, _ROW_ID] and predicate id >= 500 currently returns 
rows like (id=500, _ROW_ID=0), (id=501, _ROW_ID=1), ... instead of preserving 
the original row ids. When this is combined with with_shard(), another failure 
mode appears: if a shard is fully filtered out by the predicate, 
DataFileBatchReader._assign_row_tracking() builds an empty table and then calls 
table.to_batches()[0], which raises IndexError.
   
   So this is not only a missing test case; it can return wrong _ROW_ID values 
or fail at runtime for valid scans.
   
   Could you please fix this before merge? Possible directions:
   
   1. If native row slicing and predicate filtering are both enabled and 
_ROW_ID is requested, avoid pushing the predicate into the format reader; 
assign _ROW_ID using the original row positions first, then apply the predicate 
after row tracking is available.
   2. Or make the format reader return the selected local row offsets after 
predicate filtering, and let DataFileBatchReader assign _ROW_ID from that 
filtered offset vector.
   3. Also make _assign_row_tracking() handle zero-row batches safely instead 
of indexing table.to_batches()[0].
   
   Please also add regression tests for Lance/Vortex covering projection [id, 
_ROW_ID] + predicate + with_shard()/with_slice(), including both a non-empty 
filtered shard and an empty filtered shard.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to