Re: [PR] fix(reader): `filter_row_groups_by_byte_range` duplicates rows for sub-row-group file splits [iceberg-rust]

via GitHub Wed, 10 Jun 2026 23:41:10 -0700


advancedxy commented on code in PR #2615:
URL: https://github.com/apache/iceberg-rust/pull/2615#discussion_r3393816511



##########
crates/iceberg/src/arrow/reader/row_filter.rs:
##########
@@ -160,8 +160,14 @@ impl ArrowReader {
 
     /// Filters row groups by byte range to support Iceberg's file splitting.
     ///
-    /// Iceberg splits large files at row group boundaries, so we only read 
row groups
-    /// whose byte ranges overlap with [start, start+length).
+    /// External engines (e.g. Spark via Comet) split a data file into 
multiple scan tasks,

Review Comment:
   I think the comment could be updated to reflect the fact: at most(normal) 
cases the iceberg parquet files are split at row group boundaries. It only 
split parquet files at request size if the splitOffsets metadata is missing 
when planning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(reader): `filter_row_groups_by_byte_range` duplicates rows for sub-row-group file splits [iceberg-rust]

Reply via email to