jordepic opened a new issue, #2619:
URL: https://github.com/apache/iceberg-rust/issues/2619

   ### Apache Iceberg Rust version
   
   None
   
   ### Describe the bug
   
   filter_row_groups_by_byte_range selects a row group for every scan split 
whose byte range overlaps it. When a row group is larger than the split size — 
e.g. tables written with large write.parquet.row-group-size-bytes (1 GB), where 
a file is effectively a single row group spanning all of its splits — every 
split selects that row group and reads its full contents. The result is the 
same rows returned once per split: an N× over-read (N = number of splits 
covering the file). This is a correctness/data-duplication bug: a ~1.6M-row 
partition read back ~83M rows, and writing the result back corrupts the table.
   
   
   ### To Reproduce
   
   To Reproduce:
     1. Write a Parquet data file with a single large row group (e.g. set the 
row-group size to effectively unbounded so all rows land in one row group).
     2. Plan a scan that splits the file into multiple tasks (split size < file 
size), so several splits' byte ranges overlap the one row group. You need to 
pass these in via the public FileScanTask API.
     3. Read all splits and union the results.
     4. Row count is a multiple of the actual file row count (each split 
re-reads the whole row group).
   
   (I personally encountered the issue in DataFusion Comet)
   
   
   ### Expected behavior
   
   Each row group is assigned to exactly one split — the one whose byte range 
contains the row group's midpoint (start <= midpoint < end) — so total rows 
read equal the file's actual row count, regardless of split size vs. row-group 
size. This matches parquet-mr / Spark's split-assignment semantics.
   
   
   ### Willingness to contribute
   
   I can contribute a fix for this bug independently


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to