[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

GitBox Sun, 27 Dec 2020 23:13:42 -0800


hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r549240143




##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkWriteHelper.java
##########
@@ -62,7 +62,7 @@ public static SparkWriteHelper newInstance() {
       // we cannot allow the user to change the key or partitionPath, since 
that will affect
       // everything
       // so pick it from one of the records.
-      return new HoodieRecord<T>(rec1.getKey(), reducedData);
+      return new HoodieRecord<T>(rec1.getData().equals(reducedData) ? 
rec1.getKey() : rec2.getKey(), reducedData);

Review comment:
       For example, there are two data with the same primary key for upsert
   id partitionPath updateTime
   1 2018 2019-01-01
   1 2019 2019-02-01
   After the data is deduplicated,
   Expected return: (1,2019)->(1,2019,2019-02-01)
   Actual return: (1,2018)->(1,2019,2019-02-01)
   In this way, the hoodile key and the data content will be inconsistent, 
resulting in writing to the wrong partition.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Reply via email to