j1wonpark opened a new pull request, #4261:
URL: https://github.com/apache/amoro/pull/4261
## Why are the changes needed?
Tables intermittently get stuck during self-optimizing, and the AMS log is
flooded with:
```
TaskRuntimeException: Task has been reset or not yet scheduled
```
The root cause is that the AMS can reset a task an optimizer is still
working on, and then
reject that optimizer's valid response for it.
A task is reset (token cleared, status -> PLANNED) in two places:
- `OptimizerKeeper`, when a `SCHEDULED` task's ack does not arrive within
`optimizer.task-ack-timeout` — **even though the optimizer is still
alive** (the
`SCHEDULED + ackTimeout` branch of `buildSuspendingPredication` does not
check whether the
owning token is still active).
- `resetStaleTasksForThread`, when the same optimizer thread polls again
while one of its tasks
is still `ACKED`.
After the reset, the optimizer's in-flight ack/complete arrives and
`TaskRuntime.validThread`
sees `token == null`, so it throws. This:
- surfaces as an ERROR (`PersistentBase` "failed to commit transaction" plus
the thrift layer), and
- for a completion that lands on a meanwhile-rescheduled task, breaks the
`SCHEDULED -> SUCCESS`
transition with `IllegalTaskStateException`.
In other words, a perfectly valid optimizer response is dropped with a noisy
error.
The issue also reports the table becoming permanently stuck and
uncancelable. That symptom could
not be reproduced from the excerpt logs and may need the full log to
confirm, so this PR uses
"Relates to" rather than "Close".
Note: #4239 lowers the client-side log level for the same exception. This PR
addresses the
server-side root cause instead, and additionally covers the completion path
that #4239 does not.
## Brief change log
`TaskRuntime` now recognizes a stale response by `(token, threadId, status)`
instead of throwing
unconditionally in `validThread`:
- **ack**: if the task was reset/rescheduled, reject it **outside the
transaction** so the
optimizer skips the obsolete round, without the misleading "failed to
commit transaction" error.
The exception message is preserved so existing clients still recognize it.
- **complete**: if stale, ignore it gracefully — the reported run belongs to
a torn-down round and
the task will be re-executed in its current round. This also removes the
`IllegalTaskStateException` variant and the equivalent race on canceled
tasks.
A WARN line is logged in both cases (with status and owner) so the situation
stays observable.
## How was this patch tested?
- [x] Unit reproductions in `TestOptimizingQueue` via the real `pollTask`
path: multi-task
(stale ack is rejected) and single-task (stale completion is ignored).
- [x] End-to-end reproduction in `TestDefaultOptimizingService` driving the
real `OptimizerKeeper`:
a live optimizer (kept alive by heartbeats) leaves a `SCHEDULED` task
unacked past the ack
timeout, the keeper resets it, and the late ack is rejected as expected.
With the fix reverted,
this test reproduces the issue's exact log sequence and stack trace --
same messages and line
numbers, including the misleading "optimizer is expired" log for an
optimizer that is in fact
still alive.
- [x] Full `TestOptimizingQueue` regression passes.
## Documentation
- Does this pull request introduce a new feature? No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]