mbutrovich opened a new issue, #2614:
URL: https://github.com/apache/iceberg-rust/issues/2614

   ### Apache Iceberg Rust version
   
   Latest (`main`), also reproduces on rev `83b4595`.
   
   ### Describe the bug
   
   `ArrowReader::filter_row_groups_by_byte_range` (added in #1779) selects a 
row group for every `FileScanTask` byte range that *overlaps* it:
   
   ```rust
   // crates/iceberg/src/arrow/reader/row_filter.rs
   if current_byte_offset < end && start < row_group_end {
       selected.push(idx);
   }
   ```
   
   When a data file is split into byte ranges smaller than a row group, 
multiple tasks overlap the same row group, so each one reads it and its rows 
are duplicated.
   
   iceberg-rust's own planner never produces such splits: it emits a single 
whole-file task per file (`start=0, length=file_size`, see 
[`ManifestEntryContext::into_file_scan_task`](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/scan/context.rs)),
 so this stays latent for internal use. But `ArrowReader::read` is a public 
entry point, and external engines that plan with Iceberg-Java/Spark and drive 
`ArrowReader` directly do hit it. Apache DataFusion Comet is the concrete case 
([apache/datafusion-comet#4590](https://github.com/apache/datafusion-comet/issues/4590)):
 Spark tiles a file into `split-size` chunks regardless of row-group layout, 
producing many sub-row-group byte ranges, and the native Iceberg scan returns 
duplicate rows.
   
   parquet-java assigns each row group to exactly one split by *midpoint 
ownership*: a split owns a row group only if its range contains the row group's 
midpoint 
([`ParquetMetadataConverter.filterFileMetaDataByMidpoint`](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1287-L1290)):
   
   ```java
   long midPoint = startIndex + totalSize / 2;
   if (filter.contains(midPoint)) {
     newRowGroups.add(rowGroup);
   }
   ```
   
   `filter_row_groups_by_byte_range` should do the same. Because splits tile 
the file contiguously and disjointly, exactly one split's range contains a 
given midpoint, so each row group is read once. This is a no-op for whole-file 
tasks (every midpoint lies in `[0, file_size)`), so iceberg-rust's own scan 
path is unchanged.
   
   This was partially addressed by #1779, which added byte-range filtering but 
used overlap rather than midpoint semantics. The overlap formula happens to be 
correct only when splits land exactly on row-group boundaries (which the 
existing `test_file_splits_respect_byte_ranges` covers), so the sub-row-group 
case slipped through.
   
   ### To Reproduce
   
   Write a Parquet file with 3 row groups of 100 rows each, tile the whole file 
into 64-byte splits, read each split, and count rows. A 64-byte split is far 
smaller than a row group, so the overlap formula reads each row group from 
several splits and returns roughly 2800 rows instead of 300. Whole-file reads 
(`start=0, length=0` or `length=file_size`) are unaffected.
   
   ### Expected behavior
   
   Each row group is read by exactly one split, so the total rows across all 
splits equals the file's row count, matching parquet-java / Iceberg-Java 
semantics.
   
   ### Willingness to contribute
   
   I can contribute the fix independently (patch and regression test ready).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to