[GitHub] [flink] pnowojski commented on a diff in pull request #21503: [FLINK-30251] Move the IO with DFS during abort checkpoint to an asynchronous thread pool.

via GitHub Fri, 10 Feb 2023 04:16:38 -0800


pnowojski commented on code in PR #21503:
URL: https://github.com/apache/flink/pull/21503#discussion_r1102695388



##########
flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java:
##########
@@ -177,6 +182,14 @@ class SubtaskCheckpointCoordinatorImpl implements 
SubtaskCheckpointCoordinator {
         this.checkpoints = new HashMap<>();
         this.lock = new Object();
         this.asyncOperationsThreadPool = 
checkNotNull(asyncOperationsThreadPool);
+        this.asyncDisposeThreadPool =
+                new ThreadPoolExecutor(
+                        0,
+                        4,
+                        60L,
+                        TimeUnit.SECONDS,
+                        new LinkedBlockingQueue<>(),
+                        new ExecutorThreadFactory("AsyncDispose"));

Review Comment:
   In that case I would just limit the size of the `asyncOperationsThreadPool` 
to something like `maxConcurrentCheckpoints + 1`. It would be sensible to back 
pressure newer checkpoints, if system is not keeping up with deleting old 
checkpoints. With `maxConcurrentCheckpoints + 1` we will more or less adhere to 
the `maxConcurrentCheckpoints` configuration, but allow for a small leeway with 
allowing for simultaneous N ongoing concurrent checkpoints and clean up of one 
aborted one.
   
   Actually your current design creates a risk of resource leak, if new files 
are created faster then old one being deleted (in case of continuously failing 
checkpoints).



##########
flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java:
##########
@@ -177,6 +182,14 @@ class SubtaskCheckpointCoordinatorImpl implements 
SubtaskCheckpointCoordinator {
         this.checkpoints = new HashMap<>();
         this.lock = new Object();
         this.asyncOperationsThreadPool = 
checkNotNull(asyncOperationsThreadPool);
+        this.asyncDisposeThreadPool =
+                new ThreadPoolExecutor(
+                        0,
+                        4,
+                        60L,
+                        TimeUnit.SECONDS,
+                        new LinkedBlockingQueue<>(),
+                        new ExecutorThreadFactory("AsyncDispose"));

Review Comment:
   Fair point. In that case I would just limit the size of the 
`asyncOperationsThreadPool` to something like `maxConcurrentCheckpoints + 1`. 
It would be sensible to back pressure newer checkpoints, if system is not 
keeping up with deleting old checkpoints. With `maxConcurrentCheckpoints + 1` 
we will more or less adhere to the `maxConcurrentCheckpoints` configuration, 
but allow for a small leeway with allowing for simultaneous N ongoing 
concurrent checkpoints and clean up of one aborted one.
   
   Actually your current design creates a risk of resource leak, if new files 
are created faster then old one being deleted (in case of continuously failing 
checkpoints).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] pnowojski commented on a diff in pull request #21503: [FLINK-30251] Move the IO with DFS during abort checkpoint to an asynchronous thread pool.

Reply via email to