Re: [PR] Fix races between callback and shutdown routines in TaskQueue (druid)

via GitHub Sun, 01 Jun 2025 22:34:33 -0700


kfaraz commented on code in PR #18060:
URL: https://github.com/apache/druid/pull/18060#discussion_r2120088105



##########
indexing-service/src/test/java/org/apache/druid/indexing/overlord/TaskQueueConcurrencyTest.java:
##########
@@ -77,7 +80,20 @@ public void setUpIngestionTestBase() throws IOException
           @Override
           public ListenableFuture<TaskStatus> run(Task task)
           {
-            return Futures.immediateFuture(TaskStatus.success(task.getId()));
+            // Skip the initialization; we just need to simulate a delayed 
future
+            Preconditions.checkArgument(task instanceof NoopTask, "task must 
be an instance of NoopTask");
+            final SettableFuture<TaskStatus> future = SettableFuture.create();
+
+            taskExecutor.submit(() -> {
+              try {
+                TaskStatus status = ((NoopTask) task).runTask(null);
+                future.set(status);
+              }
+              catch (Exception e) {
+                future.setException(e);
+              }
+            });
+            return future;

Review Comment:
   > there's no point in measuring concurrency if you have no control over the 
blocking nature of the callbacks
   
   The point is to verify the blocks relationship between the methods that are 
already being asserted in the
   existing unit tests. 🙂
   
   > Without a way to parameterize the task run time, you effectively negate 
any blocking effects of callbacks (which might reveal concurrency bugs 
themselves)
   
   That is true, but the solution to that would be to add more test cases.
   
   > This test file is currently only testing concurrency correctness between 
"visible" functions being called, but doesn't consider the background callbacks 
which acquire the same locks, etc. that the tested methods do (and hence can 
cause flakes, etc)
   
   Yes, we had missed adding test cases for the callbacks in the original PR.
   Please feel free to add them as needed.
   
   > since they execute immediately, which is not at all how the production 
code runs.
   
   I am not at all opposed to the idea of having the tasks run for a bit (and 
not return immediately)
   to more closely resemble a realistic scenario. But we should do it in a more 
predictable manner so that tests are not flaky.
   
   **Note on the current approach:**
   The `NoopTask` already allows for a run duration parameter. Did you intend 
to use that here?
   But I don't see the run time of any of the `NoopTask` updated in this patch.
   The current patch launches a new thread and runs this task on that thread.
   This approach doesn't enforce whether the task run finishes __before__ or 
__after__
   any other specific event which is being verified in a given test case.
   
   > This leads to the flakes, etc. that I'm trying to fix.
   
   Yes, this is my concern exactly. To avoid flakiness, unit tests should have 
more predictable behaviour
   rather than running a task for an unspecified time, as is being done in the 
current patch.
   
   **Preferred approach:**
   My suggestion is to add new test cases to verify the flakiness that you have 
fixed in this patch.
   For those test cases, return an incomplete future for the task and force the 
unit test to complete
   that task at a specific point and then do some assertions.
   That way, we can have a test case that says:
   "if a task finishes _before_ event A, B happens"
   or "if a task finishes _after_ event A, C happens and B does not happen"
   
   **Future steps:**
   The Overlord is mired in concurrency problems. We need to have dedicated 
test setups to catch bugs
   like that, any unit test cannot and is not meant to achieve that.
   We have already started thinking about Overlord simulations for this exact 
purpose.
   
   I hope that provides some clarification for the comment. Please let me know 
if you agree.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix races between callback and shutdown routines in TaskQueue (druid)

Reply via email to