suxiaogang223 opened a new issue, #2403:
URL: https://github.com/apache/iceberg-rust/issues/2403

   ### Apache Iceberg Rust version
   
   main (`3048bd3401ba`)
   
   ### Describe the bug
   
   When reading Parquet files without embedded field IDs, `ArrowReader` applies 
`schema.name-mapping.default` to the Arrow schema if 
`FileScanTask.name_mapping` is present. However, the subsequent projection and 
predicate planning still use the original `missing_field_ids` boolean, so the 
reader takes the position-based fallback path instead of the field-id-based 
path after name mapping has been applied.
   
   This makes the name mapping branch inconsistent with the intended 
Java-compatible strategy described in the comments:
   
   - embedded field IDs -> use field-id projection
   - name mapping present -> apply name mapping, then use field-id projection
   - no name mapping -> use position fallback
   
   Current code effectively does:
   
   - embedded field IDs -> use field-id projection
   - name mapping present -> apply name mapping, then still use position 
fallback
   - no name mapping -> use position fallback
   
   Relevant code:
   
   - `crates/iceberg/src/arrow/reader/pipeline.rs`: `missing_field_ids` is 
computed before applying name mapping, then passed unchanged to 
`get_arrow_projection_mask`.
   - `crates/iceberg/src/arrow/reader/projection.rs`: 
`get_arrow_projection_mask(..., use_fallback = true)` calls 
`get_arrow_projection_mask_fallback`, which maps `field_id N` to top-level 
physical column position `N - 1`.
   - `crates/iceberg/src/arrow/reader/projection.rs`: 
`apply_name_mapping_to_arrow_schema` adds `PARQUET:field_id` metadata to the 
Arrow schema, but that mapped field-id metadata is not used by the fallback 
projection path.
   
   This can produce incorrect results for migrated Hive/Spark Parquet files 
where physical column order does not match Iceberg field IDs assigned by name 
mapping. It can also affect predicate pushdown / row filtering because 
`build_field_id_set_and_map` builds a fallback map from the original Parquet 
schema descriptor when embedded field IDs are missing.
   
   ### To Reproduce
   
   One minimal scenario:
   
   1. Use an Iceberg schema with field IDs that do not match the physical 
positions of a migrated Parquet file:
   
   ```text
   id       -> field_id 1
   name     -> field_id 2
   dept     -> field_id 3
   subdept  -> field_id 4
   ```
   
   2. Read a Parquet file without embedded field IDs whose physical columns are:
   
   ```text
   [name, subdept]
   ```
   
   3. Provide name mapping:
   
   ```text
   name    -> field_id 2
   subdept -> field_id 4
   ```
   
   4. Project only `name` and `subdept`, or filter on `name`.
   
   Expected projection mapping after name mapping:
   
   ```text
   field_id 2 -> physical column 0 (name)
   field_id 4 -> physical column 1 (subdept)
   ```
   
   Current fallback projection mapping:
   
   ```text
   field_id 2 -> physical column 1 (subdept)
   field_id 4 -> physical column 3 (out of range, ignored)
   ```
   
   The downstream `RecordBatchTransformer` then sees only `subdept(field_id=4)` 
in the source batch, treats `name(field_id=2)` as missing, and fills `name` 
with its initial default or NULL instead of reading the actual Parquet `name` 
column.
   
   For predicates such as `WHERE name = ...`, the fallback field-id map can 
evaluate the predicate on the wrong physical column (`subdept`) or use the 
wrong row-group/page statistics.
   
   ### Expected behavior
   
   After name mapping is applied, projection and predicate planning should use 
the field IDs assigned by name mapping, not the position fallback path.
   
   A possible fix is to split the current boolean into an explicit strategy, 
for example:
   
   ```rust
   enum FieldIdStrategy {
       Embedded,
       NameMapping,
       FallbackPosition,
   }
   ```
   
   Then:
   
   ```text
   Embedded       -> field-id projection / field-id predicate map
   NameMapping    -> field-id projection / field-id predicate map based on 
mapped schema
   FallbackPosition -> position fallback projection / fallback predicate map
   ```
   
   In particular, `use_fallback` should only be true for the no-name-mapping 
fallback case. Predicate pushdown should also build `field_id -> parquet column 
index` from the mapped schema, or from another mapping that reflects 
`schema.name-mapping.default`, instead of always falling back to physical 
positions when the original Parquet schema has no embedded IDs.
   
   ### Willingness to contribute
   
   I would be willing to contribute a fix for this bug with guidance from the 
Iceberg community


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to