danny0405 commented on code in PR #8568:
URL: https://github.com/apache/hudi/pull/8568#discussion_r1194859993
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringCommitSink.java:
##########
@@ -179,7 +179,7 @@ private void doCommit(String instant, HoodieClusteringPlan
clusteringPlan, List<
TableServiceType.CLUSTER, writeMetadata.getCommitMetadata().get(),
table, instant);
// whether to clean up the input base parquet files used for clustering
- if (!conf.getBoolean(FlinkOptions.CLEAN_ASYNC_ENABLED)) {
+ if (!conf.getBoolean(FlinkOptions.CLEAN_ASYNC_ENABLED) && !isCleaning) {
LOG.info("Running inline clean");
Review Comment:
After some analysis I find that there is an option
`hoodie.clean.allow.multiple` for multiple writer cleaning, and by default it
is true.
I also find that the clean action executor would try to clean all the
pending cleaning instants on the timeline if `hoodie.clean.allow.multiple` is
enabled.
But in flink streaming, the cleaning is scheduled and executed in a single
worker thread pool, that means at most 1 cleaning task is running for streaming
pipeline.
But there is possibility for batch ingestion job, the `#open` method and
`#doCommit` method can trigger the cleaning in separeate threads. Is that your
case here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]