chihsuan opened a new pull request, #10402:
URL: https://github.com/apache/ozone/pull/10402
## What changes were proposed in this pull request?
`TestReconTaskControllerImpl` took ~145s to run 17 tests. Profiling each
method
showed the time was not spread out: four tests each blocked for almost
exactly
**30 seconds** inside `ReconTaskController.stop()`.
### Root cause
`ReconTaskControllerImpl` runs a single background thread
(`processBufferedEventsAsync`) that loops on an interruptible
`eventBuffer.poll(...)` and only exits when the thread is **interrupted**:
```java
while (!Thread.currentThread().isInterrupted()) {
ReconEvent event = eventBuffer.poll(1000);
...
}
```
When this loop was introduced (HDDS-13576), `stop()` used `shutdownNow()`,
which
interrupts the thread, so it exited immediately. A later reliability change
(HDDS-13956) replaced that with a graceful shutdown:
```java
executor.shutdown(); // does NOT interrupt running
tasks
executor.awaitTermination(30, SECONDS); // waits the full timeout
executor.shutdownNow(); // only now interrupts
```
`shutdown()` stops accepting new work but never interrupts the running
thread,
and the loop has no non-interrupt exit path. So the graceful phase can never
drain the loop: `awaitTermination` always burns its full 30s timeout before
the
fallback `shutdownNow()` finally stops it. Every `stop()` therefore costs
~30s.
This is not test-only: in production it delays Recon shutdown by ~30s and
logs a
misleading "did not terminate within 30 seconds" warning on every shutdown.
### Fix (production)
Give the loop a cooperative exit path: a `volatile boolean running` flag set
in
`start()` and cleared in `stop()`. The loop checks it each iteration, so
after
`stop()` it exits on the next poll cycle and the existing graceful shutdown
completes promptly.
We chose the cooperative flag over simply reverting to `shutdownNow()`
because it
**preserves the intent of HDDS-13956**: an in-flight event is still allowed
to
finish instead of being interrupted mid-processing. The 30s
`awaitTermination`
stays as a genuine safety net rather than the normal path.
### Fix (test)
Two tests (`testProcessReInitializationEventWith*`) call `stop()` on a
Mockito
`spy(controller)` to quiesce the background loop. A spy is a shallow copy,
so it
flips the `running` flag on the copy while the live thread (started on the
original controller in `setUp`) keeps running, still hitting the 30s timeout.
They now stop the original controller. This does not change what the tests
assert; `stop()` there is only setup to avoid a race with the background
loop.
### Scope
Kept intentionally small to fix only the critical issue (the 30s production
shutdown hang). Two unrelated, lower-impact test-only slowdowns remain and
are
left for a follow-up to avoid scope creep:
- `testNewRetryLogicWithMaxRetriesExceeded` and `testFailedTaskRetryLogic`
use
real-clock `Thread.sleep` to wait out `RETRY_DELAY_MS`; speeding these up
needs
the retry delay to be injectable.
### Result
`TestReconTaskControllerImpl`: **~145s → ~26s**, all tests pass, checkstyle
clean.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15269
## How was this patch tested?
- Added `testStopCompletesPromptly`, which fails (~31s) before the fix and
passes
(~1s) after, guarding against regression.
- Ran the full `TestReconTaskControllerImpl` class: 18 tests, 0 failures,
~26s (down from ~145s).
- `mvn -pl hadoop-ozone/recon checkstyle:check`: 0 violations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]