mbutrovich opened a new issue, #2306:
URL: https://github.com/apache/iceberg-rust/issues/2306

   ### Describe the bug
   
   `build_fallback_field_id_map` maps Iceberg field IDs to wrong Parquet leaf 
column indices when the schema contains nested types (struct, list, map). This 
causes predicate evaluation to crash on migrated Parquet files (files without 
embedded field IDs).
   
   **Error:**
   "Leave column id in predicates isn't a root column in Parquet schema"
   
   This affects migrated tables where Parquet files were written by Spark/Hive 
without Iceberg field IDs, then imported via `add_files` or 
`importSparkTable()`.
   
   ### Root Cause
   
   #### How fallback field IDs work
   
   When a Parquet file lacks embedded field IDs, iceberg-rust assigns 
position-based fallback IDs. Two functions must agree on the mapping:
   
   1. `add_fallback_field_ids_to_arrow_schema` — assigns field IDs 1, 2, 3... 
to **top-level** Arrow schema fields
   2. `build_fallback_field_id_map` — maps those field IDs to Parquet **leaf** 
column indices for predicate evaluation
   
   #### What goes wrong
   
   `build_fallback_field_id_map` iterates over `parquet_schema.columns()` (leaf 
columns) instead of top-level fields. Nested types expand into multiple leaves,
   causing the mapping to diverge from the Arrow schema's field IDs.
   
   **Example:** `name: string, address: struct(street: string, city: string), 
id: int`
   
   | | Arrow top-level fields | Parquet leaf columns |
   |---|---|---|
   | Fields | name, address, id | name, street, city, id |
   | Assigned field IDs | 1, 2, 3 | 1, 2, 3, 4 (bug) |
   
   When a predicate references `id` (field_id=3 from Arrow), the column map 
returns leaf index 2 (`city`, inside the `address` group). 
`PredicateConverter::bound_reference` then calls 
`get_column_root(2).is_group()` → `true` → error.
   
   ### How Iceberg Java handles this
   
   Java's 
[`ParquetSchemaUtil.addFallbackIds()`](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java#L174-L184)
 iterates **top-level fields**, not leaf columns:
   
   ```java
   public static MessageType addFallbackIds(MessageType fileSchema) {
       MessageTypeBuilder builder = 
org.apache.parquet.schema.Types.buildMessage();
       int ordinal = 1;
       for (Type type : fileSchema.getFields()) {
           builder.addField(type.withId(ordinal));
           ordinal += 1;
       }
       return builder.named(fileSchema.getName());
   }
   ```
   
   Additionally, Java's 
https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
 gracefully handles nested types — predicates on nested columns return 
ROWS_MIGHT_MATCH instead of crashing.
   
   ### Proposed Fix
   
   Change `build_fallback_field_id_map` to iterate over 
`parquet_schema.root_schema().get_fields()`` (top-level fields) instead of 
`parquet_schema.columns()`` (leaf columns).
    For each top-level field:
   - If primitive: map `ordinal` → `leaf_column_index`
   - If group (struct/list/map): skip the mapping, advance the leaf counter 
past all leaves in that group
   
   This makes `build_fallback_field_id_map` consistent with 
`add_fallback_field_ids_to_arrow_schema`, which already correctly iterates 
top-level Arrow fields.
   
   `PredicateConverter::bound_reference` already validates that the resolved 
column is a root column and rejects groups, so no changes are needed there.
   
   Files to modify
   
   1. `crates/iceberg/src/arrow/reader.rs — build_fallback_field_id_map`
   
   Related
   
   - https://github.com/apache/datafusion-comet/issues/3860: Downstream issue 
in Comet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to