Re: [I] [VL][Delta] Delta column mapping mode "id" is not honored: native scan reads by name instead of parquet field_id [gluten]

via GitHub Thu, 04 Jun 2026 09:43:41 -0700


sezruby commented on issue #12241:
URL: https://github.com/apache/gluten/issues/12241#issuecomment-4624249910


   Update with what the actual native fix looks like, after digging into Velox 
upstream.
   
   ### Velox status
   
   Velox upstream has `ColumnMappingMode::kParquetFieldId` defined in 
`velox/dwio/common/Options.h`, but the parquet reader still throws `VELOX_NYI` 
for it:
   
   ```cpp
   if (options_.columnMappingMode() == ColumnMappingMode::kParquetFieldId) {
     VELOX_NYI("Parquet field ID column mapping is not implemented yet.");
   }
   ```
   
   The actual implementation lives in [facebookincubator/velox#17613 
(`feat(parquet): Read columns by field 
ID`)](https://github.com/facebookincubator/velox/pull/17613) and is still 
**open**. So Gluten cannot honor `parquet.field.id` until that PR merges and 
Gluten's Velox bump picks it up.
   
   ### Full fix once Velox #17613 lands (4 layers)
   
   **1. JVM — collect Delta field_ids per output column (`gluten-delta`).** The 
Delta reference schema already carries `parquet.field.id` in each 
`StructField.metadata`. After 
`DeltaPostTransformRules.transformColumnMappingPlan` rewrites `output` to 
physical names, the field_ids can be pulled out of the metadata:
   
   ```scala
   def parquetFieldIds: Option[Seq[Long]] = relation.fileFormat match {
     case d: DeltaParquetFileFormat if d.columnMappingMode == IdMapping =>
       Some(requiredSchema.fields.map { f =>
         f.metadata.getLong(DeltaColumnMapping.PARQUET_FIELD_ID_METADATA_KEY)
       })
     case _ => None
   }
   ```
   
   **2. Substrait — carry the IDs to native.** `substrait::NamedStruct` only 
carries names, so we need a Gluten extension. Mirror what 
`IcebergLocalFilesNode` already does (it extends 
`ReadRel.LocalFiles.FileOrFiles` with Iceberg-specific options including 
equality field IDs for delete files). Add a Delta variant:
   
   ```protobuf
   message DeltaReadOptions {
     ColumnMapping column_mapping = 1;     // NONE | NAME | ID
     repeated int64 parquet_field_ids = 2; // set when column_mapping == ID
   }
   ```
   
   Wire this in a new `DeltaLocalFilesNode.java` (parallel to 
`IcebergLocalFilesNode.java`) that sets the options on the file builder.
   
   **3. C++ — configure the Velox reader.** In 
`cpp/velox/compute/WholeStageResultIterator.cc` where Iceberg builds 
`HiveIcebergSplit`, do the equivalent for Delta. Cleanest path: a 
`DeltaSplitInfo` + `DeltaSplitReader` modelled on 
`IcebergSplitReader::configureBaseReaderOptions` from Velox #17613:
   
   ```cpp
   void DeltaSplitReader::configureBaseReaderOptions() {
     FileSplitReader::configureBaseReaderOptions();
     if (fileSplit_->fileFormat != dwio::common::FileFormat::PARQUET) return;
     if (deltaInfo_->columnMappingMode != ColumnMapping::kId) return;
   
     std::vector<parquet::ParquetFieldId> fieldIds;
     fieldIds.reserve(deltaInfo_->parquetFieldIds.size());
     for (auto id : deltaInfo_->parquetFieldIds) {
       fieldIds.push_back(parquet::ParquetFieldId{static_cast<int32_t>(id), 
{}});
     }
     baseReaderOpts_.setColumnMappingMode(
         dwio::common::ColumnMappingMode::kParquetFieldId);
     baseReaderOpts_.setParquetFieldIds(std::move(fieldIds));
   }
   ```
   
   **4. Immediate near-term fallback.** Until Velox #17613 lands, the safe 
change is a one-liner in `DeltaScanTransformer.doValidateInternal` that reads 
`columnMappingMode` from `DeltaParquetFileFormat` directly (instead of relying 
on the existing `hasFieldIds` check, which runs before 
`DeltaPostTransformRules` attaches the field_id metadata):
   
   ```scala
   relation.fileFormat match {
     case d: DeltaParquetFileFormat if d.columnMappingMode == IdMapping =>
       return ValidationResult.failed(
         "Delta column mapping mode 'id' requires parquet field-id matching, " +
         "which is not yet supported in native scan.")
     case _ =>
   }
   ```
   
   This makes Delta IdMapping fall back to vanilla Spark, which is correct on 
every schema-evolution edge case. NameMapping continues to work as today.
   
   ### Suggested order
   
   1. Land the immediate fallback (4) so id-mapping tables stop returning 
silently-wrong results in the meantime.
   2. Wait for Velox #17613.
   3. After Velox bump, do (1) + (2) + (3) as one PR, plumbing field_ids 
end-to-end. Drop the fallback in (4) at that point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [VL][Delta] Delta column mapping mode "id" is not honored: native scan reads by name instead of parquet field_id [gluten]

Reply via email to