chihsuan opened a new pull request, #10402:
URL: https://github.com/apache/ozone/pull/10402

   ## What changes were proposed in this pull request?
   
   `TestReconTaskControllerImpl` took ~145s to run 17 tests. Profiling each 
method
   showed the time was not spread out: four tests each blocked for almost 
exactly
   **30 seconds** inside `ReconTaskController.stop()`.
   
   ### Root cause
   
   `ReconTaskControllerImpl` runs a single background thread
   (`processBufferedEventsAsync`) that loops on an interruptible
   `eventBuffer.poll(...)` and only exits when the thread is **interrupted**:
   
   ```java
   while (!Thread.currentThread().isInterrupted()) {
       ReconEvent event = eventBuffer.poll(1000);
       ...
   }
   ```
   
   When this loop was introduced (HDDS-13576), `stop()` used `shutdownNow()`, 
which
   interrupts the thread, so it exited immediately. A later reliability change
   (HDDS-13956) replaced that with a graceful shutdown:
   
   ```java
   executor.shutdown();                       // does NOT interrupt running 
tasks
   executor.awaitTermination(30, SECONDS);    // waits the full timeout
   executor.shutdownNow();                     // only now interrupts
   ```
   
   `shutdown()` stops accepting new work but never interrupts the running 
thread,
   and the loop has no non-interrupt exit path. So the graceful phase can never
   drain the loop: `awaitTermination` always burns its full 30s timeout before 
the
   fallback `shutdownNow()` finally stops it. Every `stop()` therefore costs 
~30s.
   This is not test-only: in production it delays Recon shutdown by ~30s and 
logs a
   misleading "did not terminate within 30 seconds" warning on every shutdown.
   
   ### Fix (production)
   
   Give the loop a cooperative exit path: a `volatile boolean running` flag set 
in
   `start()` and cleared in `stop()`. The loop checks it each iteration, so 
after
   `stop()` it exits on the next poll cycle and the existing graceful shutdown
   completes promptly.
   
   We chose the cooperative flag over simply reverting to `shutdownNow()` 
because it
   **preserves the intent of HDDS-13956**: an in-flight event is still allowed 
to
   finish instead of being interrupted mid-processing. The 30s 
`awaitTermination`
   stays as a genuine safety net rather than the normal path.
   
   ### Fix (test)
   
   Two tests (`testProcessReInitializationEventWith*`) call `stop()` on a 
Mockito
   `spy(controller)` to quiesce the background loop. A spy is a shallow copy, 
so it
   flips the `running` flag on the copy while the live thread (started on the
   original controller in `setUp`) keeps running, still hitting the 30s timeout.
   They now stop the original controller. This does not change what the tests
   assert; `stop()` there is only setup to avoid a race with the background 
loop.
   
   ### Scope
   
   Kept intentionally small to fix only the critical issue (the 30s production
   shutdown hang). Two unrelated, lower-impact test-only slowdowns remain and 
are
   left for a follow-up to avoid scope creep:
   - `testNewRetryLogicWithMaxRetriesExceeded` and `testFailedTaskRetryLogic` 
use
     real-clock `Thread.sleep` to wait out `RETRY_DELAY_MS`; speeding these up 
needs
     the retry delay to be injectable.
   
   ### Result
   
   `TestReconTaskControllerImpl`: **~145s → ~26s**, all tests pass, checkstyle 
clean.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-15269
   
   ## How was this patch tested?
   
   - Added `testStopCompletesPromptly`, which fails (~31s) before the fix and 
passes
     (~1s) after, guarding against regression.
   - Ran the full `TestReconTaskControllerImpl` class: 18 tests, 0 failures,
     ~26s (down from ~145s).
   - `mvn -pl hadoop-ozone/recon checkstyle:check`: 0 violations.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to