sezruby commented on issue #12241:
URL: https://github.com/apache/gluten/issues/12241#issuecomment-4624249910
Update with what the actual native fix looks like, after digging into Velox
upstream.
### Velox status
Velox upstream has `ColumnMappingMode::kParquetFieldId` defined in
`velox/dwio/common/Options.h`, but the parquet reader still throws `VELOX_NYI`
for it:
```cpp
if (options_.columnMappingMode() == ColumnMappingMode::kParquetFieldId) {
VELOX_NYI("Parquet field ID column mapping is not implemented yet.");
}
```
The actual implementation lives in [facebookincubator/velox#17613
(`feat(parquet): Read columns by field
ID`)](https://github.com/facebookincubator/velox/pull/17613) and is still
**open**. So Gluten cannot honor `parquet.field.id` until that PR merges and
Gluten's Velox bump picks it up.
### Full fix once Velox #17613 lands (4 layers)
**1. JVM — collect Delta field_ids per output column (`gluten-delta`).** The
Delta reference schema already carries `parquet.field.id` in each
`StructField.metadata`. After
`DeltaPostTransformRules.transformColumnMappingPlan` rewrites `output` to
physical names, the field_ids can be pulled out of the metadata:
```scala
def parquetFieldIds: Option[Seq[Long]] = relation.fileFormat match {
case d: DeltaParquetFileFormat if d.columnMappingMode == IdMapping =>
Some(requiredSchema.fields.map { f =>
f.metadata.getLong(DeltaColumnMapping.PARQUET_FIELD_ID_METADATA_KEY)
})
case _ => None
}
```
**2. Substrait — carry the IDs to native.** `substrait::NamedStruct` only
carries names, so we need a Gluten extension. Mirror what
`IcebergLocalFilesNode` already does (it extends
`ReadRel.LocalFiles.FileOrFiles` with Iceberg-specific options including
equality field IDs for delete files). Add a Delta variant:
```protobuf
message DeltaReadOptions {
ColumnMapping column_mapping = 1; // NONE | NAME | ID
repeated int64 parquet_field_ids = 2; // set when column_mapping == ID
}
```
Wire this in a new `DeltaLocalFilesNode.java` (parallel to
`IcebergLocalFilesNode.java`) that sets the options on the file builder.
**3. C++ — configure the Velox reader.** In
`cpp/velox/compute/WholeStageResultIterator.cc` where Iceberg builds
`HiveIcebergSplit`, do the equivalent for Delta. Cleanest path: a
`DeltaSplitInfo` + `DeltaSplitReader` modelled on
`IcebergSplitReader::configureBaseReaderOptions` from Velox #17613:
```cpp
void DeltaSplitReader::configureBaseReaderOptions() {
FileSplitReader::configureBaseReaderOptions();
if (fileSplit_->fileFormat != dwio::common::FileFormat::PARQUET) return;
if (deltaInfo_->columnMappingMode != ColumnMapping::kId) return;
std::vector<parquet::ParquetFieldId> fieldIds;
fieldIds.reserve(deltaInfo_->parquetFieldIds.size());
for (auto id : deltaInfo_->parquetFieldIds) {
fieldIds.push_back(parquet::ParquetFieldId{static_cast<int32_t>(id),
{}});
}
baseReaderOpts_.setColumnMappingMode(
dwio::common::ColumnMappingMode::kParquetFieldId);
baseReaderOpts_.setParquetFieldIds(std::move(fieldIds));
}
```
**4. Immediate near-term fallback.** Until Velox #17613 lands, the safe
change is a one-liner in `DeltaScanTransformer.doValidateInternal` that reads
`columnMappingMode` from `DeltaParquetFileFormat` directly (instead of relying
on the existing `hasFieldIds` check, which runs before
`DeltaPostTransformRules` attaches the field_id metadata):
```scala
relation.fileFormat match {
case d: DeltaParquetFileFormat if d.columnMappingMode == IdMapping =>
return ValidationResult.failed(
"Delta column mapping mode 'id' requires parquet field-id matching, " +
"which is not yet supported in native scan.")
case _ =>
}
```
This makes Delta IdMapping fall back to vanilla Spark, which is correct on
every schema-evolution edge case. NameMapping continues to work as today.
### Suggested order
1. Land the immediate fallback (4) so id-mapping tables stop returning
silently-wrong results in the meantime.
2. Wait for Velox #17613.
3. After Velox bump, do (1) + (2) + (3) as one PR, plumbing field_ids
end-to-end. Drop the fallback in (4) at that point.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]