[GitHub] [hudi] jonvex commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

via GitHub Sun, 02 Jul 2023 09:38:57 -0700


jonvex commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1249618007



##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##########
@@ -348,7 +387,7 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
    * expressions to the ExpressionPayload#getInsertValue.
    */
   private def executeUpsert(sourceDF: DataFrame, parameters: Map[String, 
String]): Unit = {
-    val operation = if 
(StringUtils.isNullOrEmpty(parameters.getOrElse(PRECOMBINE_FIELD.key, ""))) {
+    val operation = if 
(StringUtils.isNullOrEmpty(parameters.getOrElse(PRECOMBINE_FIELD.key, "")) && 
updatingActions.isEmpty) {

Review Comment:
   Previously we were just doing an insert operation if the precombine was not 
set. Because we now do the join and get the meta cols for the existing records, 
the resulting behavior was very strange.  If you look at the last test in 
TestMergeIntoTable2.scala "Test only insert for source table in dup key without 
preCombineField", what was happening is that we would end up with 5 records as 
the final result. The record id = 1 would match on both of the existing id 1 
records in the table so the df that would get sent to spark sql writer would 
have 2 of that and 1 of id = 3 and then insert everything. The previous 
behavior was that only 1 additional copy would be inserted. This behavior does 
not make sense because first of all, there are 2 matching records, and 
secondly, it is "update set *" not "insert set *". I think this behavior makes 
more sense because we do an update now for matching records. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jonvex commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

Reply via email to