[GitHub] [hudi] ksmou commented on a diff in pull request #8944: [HUDI-6359]Spark offline compaction/clustering will never rollback when both requested and inflight states exist

via GitHub Wed, 14 Jun 2023 18:42:55 -0700


ksmou commented on code in PR #8944:
URL: https://github.com/apache/hudi/pull/8944#discussion_r1230325265



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java:
##########
@@ -209,8 +209,7 @@ private int doCluster(JavaSparkContext jsc) throws 
Exception {
         // Instant time is not specified
         // Find the earliest scheduled clustering instant for execution
         Option<HoodieInstant> firstClusteringInstant =
-            metaClient.getActiveTimeline().firstInstant(
-                HoodieTimeline.REPLACE_COMMIT_ACTION, 
HoodieInstant.State.REQUESTED);
+            
metaClient.getActiveTimeline().filterPendingReplaceTimeline().firstInstant();

Review Comment:
   > This is intentional because we should not execute a clustering instant 
which is already inflight. If a replacecommit is inflight and the job failed, 
the right process is to roll back the inflight clustering to requested state 
first, see: 
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDTableServiceClient.java#L198
   
   We should not execute a clustering instant which is already inflight is 
correct. But if a replacecommit is inflight and the job failed, variable 
`firstClusteringInstant` is empty for the failed clustering instant in next 
clustering job. It will throw an exception `throw new 
HoodieClusteringException("There is no scheduled clustering in the table.");`  
This will cause the rollback method not to be called in 
`SparkRDDTableServiceClient#cluster` code.
   ```
     Option<HoodieInstant> firstClusteringInstant =
         metaClient.getActiveTimeline().firstInstant(
                   HoodieTimeline.REPLACE_COMMIT_ACTION, 
HoodieInstant.State.REQUESTED);
     ...
     // will not call cluster method for the failed clustering instant
     Option<HoodieCommitMetadata> commitMetadata = 
client.cluster(cfg.clusteringInstantTime).getCommitMetadata();
   ````
   
   As **danny0405** said, it is not applicable if multiple offline 
compaction/clustering jobs running concurrently. I will try to figure out how 
to adjust.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ksmou commented on a diff in pull request #8944: [HUDI-6359]Spark offline compaction/clustering will never rollback when both requested and inflight states exist

Reply via email to