[GitHub] [hudi] danny0405 commented on a diff in pull request #8568: [HUDI-6134] prevent two clean run concurrently in flink.

via GitHub Tue, 16 May 2023 02:09:51 -0700


danny0405 commented on code in PR #8568:
URL: https://github.com/apache/hudi/pull/8568#discussion_r1194859993



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringCommitSink.java:
##########
@@ -179,7 +179,7 @@ private void doCommit(String instant, HoodieClusteringPlan 
clusteringPlan, List<
         TableServiceType.CLUSTER, writeMetadata.getCommitMetadata().get(), 
table, instant);
 
     // whether to clean up the input base parquet files used for clustering
-    if (!conf.getBoolean(FlinkOptions.CLEAN_ASYNC_ENABLED)) {
+    if (!conf.getBoolean(FlinkOptions.CLEAN_ASYNC_ENABLED) && !isCleaning) {
       LOG.info("Running inline clean");

Review Comment:
   After some analysis I find that there is an option 
`hoodie.clean.allow.multiple` for multiple writer cleaning, and by default it 
is true.
   
   I also find that the clean action executor would try to clean all the 
pending cleaning instants on the timeline if `hoodie.clean.allow.multiple` is 
enabled.
   
   But in flink streaming, the cleaning is scheduled and executed in a single 
worker thread pool, that means at most 1 cleaning task is running for streaming 
pipeline.
   
   But there is possibility for batch ingestion job, the `#open` method and 
`#doCommit` method can trigger the cleaning in separeate threads. Is that your 
case here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] danny0405 commented on a diff in pull request #8568: [HUDI-6134] prevent two clean run concurrently in flink.

Reply via email to