Re: [PR] [spark] support persist source data to avoid loading data repeatedly [paimon]

via GitHub Wed, 03 Jun 2026 06:21:11 -0700


JingsongLi commented on code in PR #8081:
URL: https://github.com/apache/paimon/pull/8081#discussion_r3348787975



##########
paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/MergeIntoPaimonDataEvolutionTable.scala:
##########
@@ -426,7 +451,8 @@ case class MergeIntoPaimonDataEvolutionTable(
 
       val sourceTableProjExprs =
         allReadFieldsOnSource.toSeq :+ Alias(TrueLiteral, ROW_FROM_SOURCE)()
-      val sourceTableProj = Project(sourceTableProjExprs, sourceTable)
+      val sourceChild = 
persistSourceDss.map(_.queryExecution.logical).getOrElse(sourceTable)

Review Comment:
   This only wires the cached source into the matched/update path. For a MERGE 
that has both matched and not-matched clauses, `insertActionInvoke` still 
builds its left-anti join from `sourceTable`, so the source is scanned again 
after the update path. Could you pass the persisted source into the insert path 
too, so the new option avoids repeated source loading for the whole merge 
action?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [spark] support persist source data to avoid loading data repeatedly [paimon]

Reply via email to