sezruby opened a new issue, #12241:
URL: https://github.com/apache/gluten/issues/12241
### Backend
VL (Velox)
### Bug description
Delta supports two column-mapping modes: `name` and `id`. The two modes
write the same parquet files (column names = Delta's `physicalName`, plus
`parquet.field.id` metadata) but differ at *read* time:
- **NameMapping** — reader matches columns by physical name. Vanilla Spark +
Delta strips `parquet.field.id` from the read schema in
`DeltaParquetFileFormat.prepareSchemaForRead` (delta-io/delta
`DeltaParquetFileFormat.scala`), so the parquet reader uses names.
- **IdMapping** — reader matches columns by `parquet.field.id`. The metadata
is *kept* in the read schema and Spark's parquet reader (with
`spark.sql.parquet.fieldId.read.enabled = true`) does field-id-based matching.
This is the only mode that survives schema-evolution edge cases (e.g. `DROP` +
re-add of a column with the same name, restored older snapshots, or Delta
UNIFORM/Iceberg interop where the parquet writer is not Delta and may use
arbitrary column names).
Gluten/Velox does not implement field-id matching for Delta IdMapping. The
native scan reads by physical name, which is correct in the common case
(Delta-only writes never rename a physical name) but silently produces wrong
results when the parquet file's column name diverges from Delta's current
`physicalName`.
### What the code does today
`DeltaPostTransformRules.transformColumnMappingPlan` in `gluten-delta`
rewrites the scan output / `dataSchema` / `requiredSchema` to physical names.
The metadata that `DeltaColumnMapping.createPhysicalSchema` produces does
include `parquet.field.id` (line 270 in delta-io/delta
`DeltaColumnMapping.scala`), but Gluten passes the schema to native via
Substrait's `NamedStruct`, which only carries names — `field_id` is dropped on
the way down. The native side has no idea that field-id matching should be used.
The existing fallback in `FileSourceScanExecTransformer.doValidateInternal`:
```scala
if (hasFieldIds) {
return ValidationResult.failed("Unsupported matching schema column names
by field ids in native scan.")
}
```
does NOT fire for Delta IdMapping, because validation runs **before**
`DeltaPostTransformRules`, so at that point `requiredSchema` is the original
logical schema with no `parquet.field.id` metadata. The post-transform later
adds the field_id metadata, but the scan is already on the native path.
### Reproducer (the kind of case where the bug surfaces)
A reproducer that exercises a real divergence between Delta's `physicalName`
and the parquet column name needs schema evolution that exposes the
field_id-vs-name distinction. The simplest path:
1. Create a Delta table with `delta.columnMapping.mode = id`, two columns
`(a, b)`.
2. Drop column `a`, re-add a new column `a` with a different type.
3. The new column has the same logical name `a`, a *new* `physicalName`, and
a *new* `field_id`. The old parquet files still contain the old column with its
old physical name and old field_id.
4. SELECT * — vanilla Spark uses field_id and treats new files' `a` and old
files' `a` as different columns. Gluten/Velox reads by physical name, so it
sees the old physical name in old files (which is now treated as the new
column) → values from a *different* logical column appear under `a`.
I'll add a unit test to the linked PR (#12240) that exercises this once the
IdMapping fallback lands.
### Expected behavior
When `delta.columnMapping.mode = id`:
- Either fall back to vanilla Spark when `columnMappingMode == IdMapping`, OR
- Honor `parquet.field.id` in the native reader. Velox upstream now has
[`ColumnMappingMode::kParquetFieldId`](https://github.com/facebookincubator/velox/blob/main/velox/dwio/common/Options.h)
— Gluten can plumb the field-ids through Substrait + the Hive connector to use
it.
Recommended near-term: add an explicit fallback in
`DeltaScanTransformer.doValidateInternal` when
`DeltaParquetFileFormat.columnMappingMode == IdMapping`, since the current
`hasFieldIds` check misses Delta's case. The longer-term path is plumbing
`kParquetFieldId` into the native side.
### Related
Discovered while working on #10511 / #12240 (the partition-filter
wrong-result fix). #12240 fixes the partition-pruning correctness issue for
both modes but does not address this id-mapping read-path gap.
### Gluten version
main branch
### Spark version
Spark-3.5.x
### Spark configurations
```
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.parquet.fieldId.read.enabled=true
spark.sql.parquet.fieldId.write.enabled=true
```
### System information
N/A — design issue, not environment-dependent.
### Relevant logs
N/A.
This issue was written with the assistance of AI.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]