JingsongLi commented on PR #8064:
URL: https://github.com/apache/paimon/pull/8064#issuecomment-4610148175

   I rechecked the current head (`b887962e`). The API order and the 
persistent-worker `set_epoch()` concern look addressed now, but the DE 
chunk-shuffle row-id issue from the earlier review is still reproducible.
   
   `DataEvolutionChunkShuffleSplitGenerator._split_by_row_id_with_range()` 
still filters out files whose `row_id_range()` is `None` and returns no chunks 
when all files are missing `first_row_id`. Since `data-evolution.enabled` and 
`row-tracking.enabled` are independent table options, a table can have 
`data-evolution.enabled=true` without row tracking. In that case the normal DE 
scan fails during planning, but the chunk-shuffle path silently returns an 
empty plan and reading it yields 0 rows.
   
   Minimal shape I verified locally:
   
   ```python
   schema = Schema.from_pyarrow_schema(
       pa.schema([('id', pa.int32()), ('v', pa.string())]),
       options={'data-evolution.enabled': 'true'},
   )
   # write 3 rows...
   rb = table.new_read_builder()
   rb.new_scan().plan()  # AttributeError: 'NoneType' object has no attribute 
'from_'
   rb.new_scan().with_chunk_shuffle(seed=1, chunk_size=2).plan().splits()  # []
   rb.new_read().to_arrow([]).num_rows  # 0
   ```
   
   Could we make the chunk-shuffle path fail fast with a clear error when 
`self.data_evolution` is true but row tracking / `first_row_id` is unavailable, 
or otherwise fall back to the regular DE generator behavior? Silently returning 
an empty plan is dangerous because it looks like a valid empty table.
   
   I also reran:
   
   ```bash
   PYTHONPATH=. python -m pytest -q 
pypaimon/tests/scanner/chunk_shuffle_split_generator_test.py
   # 36 passed
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to