[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #3240: [HUDI-2144]Bug-Fix:Offline clustering(HoodieClusteringJob) will cause insert action losing data

GitBox Sat, 10 Jul 2021 22:32:26 -0700


zhangyue19921010 commented on a change in pull request #3240:
URL: https://github.com/apache/hudi/pull/3240#discussion_r667425069




##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##########
@@ -146,7 +146,7 @@ private int addUpdateBucket(String partitionPath, String 
fileIdHint) {
    * @return smallFiles not in clustering
    */
   private List<SmallFile> filterSmallFilesInClustering(final Set<String> 
pendingClusteringFileGroupsId, final List<SmallFile> smallFiles) {
-    if (this.config.isClusteringEnabled()) {
+    if (this.config.isClusteringEnabled() || 
!pendingClusteringFileGroupsId.isEmpty()) {

Review comment:
       Hi @codope Thanks for your view. 
   
   As I mentioned above.  If the clustering is not enabled during insert but 
there is an offline clustering plan job is running for this table, the 
`pendingClusteringFileGroupsId` set could be non-empty too. 
   
   I just add a UT named 
`testUpsertPartitionerWithSmallFileHandlingAndClusteringPlan` to prove it. I 
set 
`withClusteringConfig(HoodieClusteringConfig.newBuilder().withInlineClustering(false).withAsyncClustering(false).build())`
 during ingestion and create a clustering plan manually out of this ingestion 
scope. 
   
   As you you can see, If the condition is `this.config.isClusteringEnabled()` 
only, there will be two small slices can be inserted including file slice 1 in 
pending clustering plan. 
   
   If the condition is `!pendingClusteringFileGroupsId.isEmpty()`, there will 
be only one small slice can be used and file slice 1 is filtered.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #3240: [HUDI-2144]Bug-Fix:Offline clustering(HoodieClusteringJob) will cause insert action losing data

Reply via email to