leaves12138 commented on PR #7673: URL: https://github.com/apache/paimon/pull/7673#issuecomment-4508990995
Thanks for the update. I re-checked the latest patch (06650b515329). The no-predicate cases look much better now: native Lance shard/range pushdown and IndexedSplit sparse _ROW_ID assignment are covered by the new tests, and I was able to run the Lance affected tests locally after installing pylance. However, I think there is still a blocking correctness/runtime issue when native row slicing is combined with a file-level predicate and _ROW_ID projection. The current flow is: - split_read.py passes row_indices / shard_range into the Lance/Vortex format readers. - FormatLanceReader applies read_range() / take_rows() first, and then applies the Arrow predicate on the sliced in-memory table. - FormatVortexReader passes both expr and indices into scan(). - DataFileBatchReader then assigns _ROW_ID by assuming the returned rows are still the original contiguous slice, or the prefix of row_id_offsets. That assumption no longer holds after predicate filtering. For example, with a Lance append-only file containing ids 0..999 and row tracking enabled, reading with projection [id, _ROW_ID] and predicate id >= 500 currently returns rows like (id=500, _ROW_ID=0), (id=501, _ROW_ID=1), ... instead of preserving the original row ids. When this is combined with with_shard(), another failure mode appears: if a shard is fully filtered out by the predicate, DataFileBatchReader._assign_row_tracking() builds an empty table and then calls table.to_batches()[0], which raises IndexError. So this is not only a missing test case; it can return wrong _ROW_ID values or fail at runtime for valid scans. Could you please fix this before merge? Possible directions: 1. If native row slicing and predicate filtering are both enabled and _ROW_ID is requested, avoid pushing the predicate into the format reader; assign _ROW_ID using the original row positions first, then apply the predicate after row tracking is available. 2. Or make the format reader return the selected local row offsets after predicate filtering, and let DataFileBatchReader assign _ROW_ID from that filtered offset vector. 3. Also make _assign_row_tracking() handle zero-row batches safely instead of indexing table.to_batches()[0]. Please also add regression tests for Lance/Vortex covering projection [id, _ROW_ID] + predicate + with_shard()/with_slice(), including both a non-empty filtered shard and an empty filtered shard. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
