NataliaLaurova opened a new issue, #15526:
URL: https://github.com/apache/iceberg/issues/15526

   ### Apache Iceberg version
   
   1.7.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug šŸž
   
   When performing a MERGE INTO operation on an Apache Iceberg table with a 
large number of columns (~450), Spark 3.5 fails during the analysis phase with 
an UNRESOLVED_COLUMN error. The error is paradoxical because the "suggested 
columns" in the error message include the exact column name and alias that the 
analyzer claims it cannot resolve.
   
   The same SQL logic works successfully on a smaller "test" version of the 
table (e.g., 10 columns) and executes successfully in other engines (e.g., 
Athena/Trino), suggesting a specific regression or limitation in the Spark 
Catalyst Optimizer’s ability to bind references in extremely wide 
MergeIntoTable logical plans.
   
   Environment:
   
   Spark Version: 3.5.x
   
   Iceberg Version: [Insert your version, e.g., 1.4.3]
   
   Catalog: Glue Catalog
   
   Table Schema: ~450 columns, partitioned by [Insert column, e.g., date].
   
   Steps to Reproduce:
   
   Create an Iceberg table with 400+ columns.
   
   Create a source staging table/view with a similar schema.
   
   Run a MERGE INTO statement with 10+ join keys and 10+ column updates.
   
   Observe the AnalysisException despite the columns being present and 
correctly typed.
   
   Actual Error:
   
   Plaintext
   AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function 
parameter with name `target`.`dmsoptimestamp` cannot be resolved. Did you mean 
one of the following? [`target`.`dmsoptimestamp`, `source`.`dmsoptimestamp`, 
`target`.`uploadtimestamp`, ...]
   Expected Behavior:
   The analyzer should successfully bind the attributes to the target alias as 
it does with narrower tables.
   
   Additional Context:
   
   Increasing spark.sql.analyzer.maxIterations does not resolve the issue.
   
   Materializing the source view into a physical table does not resolve the 
issue.
   
   The issue appears unique to the SQL MERGE syntax; the DataFrame API (Join + 
Overwrite) works as a workaround, indicating the issue lies in the 
SQL-to-Logical-Plan resolution phase.
   
   The failure occurs during the Resolution phase of the Catalyst Optimizer. 
Specifically, when the MergeIntoTable node is being resolved:
   
   The AttributeMap for the target relation becomes excessively large (450+ 
entries).
   
   The Rule<LogicalPlan> executor for ResolveReferences appears to hit a 
recursion or iteration limit when trying to bind the target alias to the 
specific AttributeReference in the RelationV2 scan.
   
   Engine Discrepancy: The fact that Athena (Trino-based) resolves this plan 
successfully implies that Spark’s rule-based resolution of MergeIntoTable is 
not scaling linearly with schema width.
   
   The same merge into request worked in Athena/trino engine for limited number 
of columns for update
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to