Ma77Ball opened a new pull request, #5737:
URL: https://github.com/apache/texera/pull/5737

   ### What changes were proposed in this PR?
   - `RegionExecutionCoordinator.terminateWorkersWithRetry` retried worker 
termination with no limit, so a worker that never finished draining (its 
`EndWorker` kept failing) left the region's termination future unresolved and 
the workflow never reached COMPLETED, surfacing as an opaque ~1 minute timeout 
in `DataProcessingSpec` with no indication of the stuck region or workers.
   - Bounded the retry by a new `maxTerminationAttempts` budget (default 150, 
about 30s at the existing 200ms delay); on exhaustion the termination future 
fails with a descriptive error naming the region and the still-unterminated 
workers, instead of retrying indefinitely.
   - Made `maxTerminationAttempts` and `killRetryDelay` constructor parameters 
with production defaults so the behavior is unit-testable without long waits.
   - Scope note: this is a fail-fast/diagnosability change (it converts a 
silent hang into a fast, explicit failure, matching the pattern in #4683), not 
a guaranteed elimination of the underlying termination flake.
   ### Any related issues, documentation, discussions?
   Closes: #5614
   ### How was this PR tested?
   - Run `sbt "WorkflowExecutionService/testOnly 
*RegionExecutionCoordinatorSpec"` (under JDK 17); expect 3 tests succeeded, 0 
failed.
   - New test `give up with a descriptive error once the EndWorker retry budget 
is exhausted`: forces `EndWorker` to always fail, then asserts the completion 
future fails with `IllegalStateException` containing "could not be terminated 
after 3 attempts", the coordinator is not marked completed, exactly 3 
`EndWorker` calls were made, and the worker actor ref is retained.
   - Existing tests in the same spec (gracefulStop ordering, transient-failure 
retry-then-succeed) still pass, confirming the normal and transient paths are 
unchanged.
   ### Was this PR authored or co-authored using generative AI tooling?
   Co-authored with Claude Opus 4.8 in compliance with ASF


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to