[PR] fix: [DO NOT MERGE] Iceberg scan duplicates rows when splitting a single-row-group Parquet file into multiple byte-range tasks [datafusion-comet]

via GitHub Wed, 10 Jun 2026 11:17:03 -0700


mbutrovich opened a new pull request, #4621:
URL: https://github.com/apache/datafusion-comet/pull/4621


   ## Which issue does this PR close?
   
   Closes #4590.
   
   ## Rationale for this change
   
   The native Iceberg scan returns duplicate rows when an Iceberg 
`FileScanTask` splits a data file into byte ranges smaller than a row group. 
Spark/Iceberg planning tiles a file by `split-size` regardless of row-group 
layout, so a single row group can be covered by several `FileScanTask`s. The 
root cause is in iceberg-rust's `ArrowReader`: it selected a row group for 
every task byte range that *overlapped* it, so each overlapping split read the 
row group and its rows were duplicated.
   
   Comet only forwards the `start`/`length` Spark planned; serialization and 
the native round-trip are correct. The fix belongs in iceberg-rust, which now 
assigns each row group to the single split whose range contains the row group's 
midpoint (matching parquet-java semantics). See apache/iceberg-rust#2614 and 
the fix in apache/iceberg-rust#2615.
   
   The Parquet path (`native_datafusion`) is unaffected: arrow-rs/DataFusion 
already use midpoint ownership, confirmed by the regression test added here.
   
   ## What changes are included in this PR?
   
   - Bump the `iceberg` / `iceberg-storage-opendal` dependency to a rev that 
includes the row-group midpoint fix.
   - Adapt `IcebergScanExec` to the updated iceberg-rust API: 
`ArrowReaderBuilder::new` now takes an `iceberg::Runtime`. `execute()` builds 
the reader on the Spark executor thread (outside any tokio context), so the 
runtime is constructed from Comet's shared global runtime via 
`IcebergRuntime::new(get_runtime())` rather than `Runtime::current()`.
   - Add regression tests for both scan paths.
   
   > [!NOTE]
   > Draft: `native/Cargo.toml` currently points at `mbutrovich/iceberg-rust` 
branch `filescantask_midpoint` so CI can exercise apache/iceberg-rust#2615 
before it merges. This will be repointed to `apache/iceberg-rust` at the merged 
SHA before this PR leaves draft.
   
   ## How are these changes tested?
   
   Two new tests, both verified to fail before the fix and pass after:
   
   - `CometIcebergNativeSuite` "native Iceberg scan does not duplicate a row 
group split by byte range" — writes a single-row-group file, registers it as an 
Iceberg table, reads it with `split-size`/`file-open-cost` of 64 to force 
sub-row-group splits, and asserts the matching row is returned exactly once. 
Returned 2 rows before the fix.
   - `CometNativeReaderSuite` "native scan does not duplicate a row group split 
by byte range" — the same scenario on the `native_datafusion` Parquet path, 
guarding against regression. Asserts the file is actually split into multiple 
partitions and still returns each row once.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: [DO NOT MERGE] Iceberg scan duplicates rows when splitting a single-row-group Parquet file into multiple byte-range tasks [datafusion-comet]

Reply via email to