Re: [PR] [SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns [spark]

via GitHub Sun, 09 Nov 2025 22:19:54 -0800


szehon-ho commented on PR #52866:
URL: https://github.com/apache/spark/pull/52866#issuecomment-3509623236


   Discuss offline with @cloud-fan.  The new condition unfortunately forces us 
to move to rule to be evaluated after an initial pass of ResolveReferences.  
Because we are checking value assignments (from source table), it is important 
that these are resolved so we can be sure they mean its an assignment from 
there.
   
   
   Overall the logic is:
   1. ResolveReferences resolves all columns it can.  But leave unresolved 
assignment because it can be an assignment key that does not exist yet in the 
target schema, and will be added later in schema evolution.  
   2. ResolveMergeIntoSchemaEvolution now runs after ResolveReferences.  It 
must unresolve all expressions, because they were resolved on the old table.   
This triggers another run of ResolveReferences.
   3. The final run of ResolveReferences will resolve all references based on 
the new target table.
   
   
   Many changes:
   
   1. Change ResolveReferences to no longer eagerly throw exception on the 
first run, but to throw on the second run after schema evolution is evaluated.
   2. Change ResolveReferences to expand UPDATE SET * and INSERT * to fill in 
missing assignments for columns in source and not target (to trigger the 
ResolveMergeIntoSchamEvolution condition)
   1. ResolveMergeIntoSchemaEvolution:  Add a guard 
MergeIntoTable.canEvaluateSchemaEvolution to not trigger until 
ResolveReferences is run the first time (it checks if all assignments are 
either resolved, or if not they are possibly solved by schema evolution)
   3. ResolveMergeIntoSchamEvolution: calculate sourceSchemaForEvolution which 
is the columns in source schema that will be added to target, pruning those 
that are directly subject of an assignment where the key does not exist in 
column but is referenced by a assignment value from a source column/field of 
the same name.
   4. ResolveMergeIntoSchamEvolution: unresolve everything.  Because it runs 
now after ResolveReferences, we need to re-resolve everything because the 
target table changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns [spark]

Reply via email to