suxiaogang223 opened a new issue, #2403:
URL: https://github.com/apache/iceberg-rust/issues/2403
### Apache Iceberg Rust version
main (`3048bd3401ba`)
### Describe the bug
When reading Parquet files without embedded field IDs, `ArrowReader` applies
`schema.name-mapping.default` to the Arrow schema if
`FileScanTask.name_mapping` is present. However, the subsequent projection and
predicate planning still use the original `missing_field_ids` boolean, so the
reader takes the position-based fallback path instead of the field-id-based
path after name mapping has been applied.
This makes the name mapping branch inconsistent with the intended
Java-compatible strategy described in the comments:
- embedded field IDs -> use field-id projection
- name mapping present -> apply name mapping, then use field-id projection
- no name mapping -> use position fallback
Current code effectively does:
- embedded field IDs -> use field-id projection
- name mapping present -> apply name mapping, then still use position
fallback
- no name mapping -> use position fallback
Relevant code:
- `crates/iceberg/src/arrow/reader/pipeline.rs`: `missing_field_ids` is
computed before applying name mapping, then passed unchanged to
`get_arrow_projection_mask`.
- `crates/iceberg/src/arrow/reader/projection.rs`:
`get_arrow_projection_mask(..., use_fallback = true)` calls
`get_arrow_projection_mask_fallback`, which maps `field_id N` to top-level
physical column position `N - 1`.
- `crates/iceberg/src/arrow/reader/projection.rs`:
`apply_name_mapping_to_arrow_schema` adds `PARQUET:field_id` metadata to the
Arrow schema, but that mapped field-id metadata is not used by the fallback
projection path.
This can produce incorrect results for migrated Hive/Spark Parquet files
where physical column order does not match Iceberg field IDs assigned by name
mapping. It can also affect predicate pushdown / row filtering because
`build_field_id_set_and_map` builds a fallback map from the original Parquet
schema descriptor when embedded field IDs are missing.
### To Reproduce
One minimal scenario:
1. Use an Iceberg schema with field IDs that do not match the physical
positions of a migrated Parquet file:
```text
id -> field_id 1
name -> field_id 2
dept -> field_id 3
subdept -> field_id 4
```
2. Read a Parquet file without embedded field IDs whose physical columns are:
```text
[name, subdept]
```
3. Provide name mapping:
```text
name -> field_id 2
subdept -> field_id 4
```
4. Project only `name` and `subdept`, or filter on `name`.
Expected projection mapping after name mapping:
```text
field_id 2 -> physical column 0 (name)
field_id 4 -> physical column 1 (subdept)
```
Current fallback projection mapping:
```text
field_id 2 -> physical column 1 (subdept)
field_id 4 -> physical column 3 (out of range, ignored)
```
The downstream `RecordBatchTransformer` then sees only `subdept(field_id=4)`
in the source batch, treats `name(field_id=2)` as missing, and fills `name`
with its initial default or NULL instead of reading the actual Parquet `name`
column.
For predicates such as `WHERE name = ...`, the fallback field-id map can
evaluate the predicate on the wrong physical column (`subdept`) or use the
wrong row-group/page statistics.
### Expected behavior
After name mapping is applied, projection and predicate planning should use
the field IDs assigned by name mapping, not the position fallback path.
A possible fix is to split the current boolean into an explicit strategy,
for example:
```rust
enum FieldIdStrategy {
Embedded,
NameMapping,
FallbackPosition,
}
```
Then:
```text
Embedded -> field-id projection / field-id predicate map
NameMapping -> field-id projection / field-id predicate map based on
mapped schema
FallbackPosition -> position fallback projection / fallback predicate map
```
In particular, `use_fallback` should only be true for the no-name-mapping
fallback case. Predicate pushdown should also build `field_id -> parquet column
index` from the mapped schema, or from another mapping that reflects
`schema.name-mapping.default`, instead of always falling back to physical
positions when the original Parquet schema has no embedded IDs.
### Willingness to contribute
I would be willing to contribute a fix for this bug with guidance from the
Iceberg community
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]