[PR] [CORE] Optimize Iceberg schema field matching [gluten]

via GitHub Thu, 04 Jun 2026 01:59:58 -0700


wankunde opened a new pull request, #12233:
URL: https://github.com/apache/gluten/pull/12233


   ## What changes are proposed in this pull request?
   
   Why this PR is needed?
   
   In `IcebergScanTransformer.typesMatch()`, the struct type matching logic 
creates temporary Iceberg `Schema` objects for every Spark field:
   
   ```scala
   new Schema(currentType.fields()).findField(...)
   new Schema(iceberg.fields()).findField(...)
   ```
   
   This repeatedly rebuilds Iceberg schema indexes while checking historical 
schemas, which can become expensive for wide schemas or tables with many schema 
versions. In production thread dumps, this shows up in `Schema` / `IndexByName` 
/ `HashMap` initialization during Iceberg scan planning.
   
   Changes in this PR:
   
   This change uses `Types.StructType.field(name)` and 
`Types.StructType.field(id)` directly when matching nested struct fields.
   
   `Types.StructType` already provides field lookup by name and id, so this 
avoids constructing temporary `Schema` objects inside the field loop while 
preserving the existing matching behavior:
   - find the current field by Spark field name
   - find the old schema field by Iceberg field id
   - keep allowing added columns
   - keep detecting renamed columns by comparing field names
   
   ## How was this patch tested?
   
   Test with exist UT
   
   ## Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Codex GPT-5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [CORE] Optimize Iceberg schema field matching [gluten]

Reply via email to