j1wonpark opened a new pull request, #4261:
URL: https://github.com/apache/amoro/pull/4261

   ## Why are the changes needed?
   
   Tables intermittently get stuck during self-optimizing, and the AMS log is 
flooded with:
   
   ```
   TaskRuntimeException: Task has been reset or not yet scheduled
   ```
   
   The root cause is that the AMS can reset a task an optimizer is still 
working on, and then
   reject that optimizer's valid response for it.
   
   A task is reset (token cleared, status -> PLANNED) in two places:
   
   - `OptimizerKeeper`, when a `SCHEDULED` task's ack does not arrive within
     `optimizer.task-ack-timeout` — **even though the optimizer is still 
alive** (the
     `SCHEDULED + ackTimeout` branch of `buildSuspendingPredication` does not 
check whether the
     owning token is still active).
   - `resetStaleTasksForThread`, when the same optimizer thread polls again 
while one of its tasks
     is still `ACKED`.
   
   After the reset, the optimizer's in-flight ack/complete arrives and 
`TaskRuntime.validThread`
   sees `token == null`, so it throws. This:
   
   - surfaces as an ERROR (`PersistentBase` "failed to commit transaction" plus 
the thrift layer), and
   - for a completion that lands on a meanwhile-rescheduled task, breaks the 
`SCHEDULED -> SUCCESS`
     transition with `IllegalTaskStateException`.
   
   In other words, a perfectly valid optimizer response is dropped with a noisy 
error.
   
   The issue also reports the table becoming permanently stuck and 
uncancelable. That symptom could
   not be reproduced from the excerpt logs and may need the full log to 
confirm, so this PR uses
   "Relates to" rather than "Close".
   
   Note: #4239 lowers the client-side log level for the same exception. This PR 
addresses the
   server-side root cause instead, and additionally covers the completion path 
that #4239 does not.
   
   ## Brief change log
   
   `TaskRuntime` now recognizes a stale response by `(token, threadId, status)` 
instead of throwing
   unconditionally in `validThread`:
   
   - **ack**: if the task was reset/rescheduled, reject it **outside the 
transaction** so the
     optimizer skips the obsolete round, without the misleading "failed to 
commit transaction" error.
     The exception message is preserved so existing clients still recognize it.
   - **complete**: if stale, ignore it gracefully — the reported run belongs to 
a torn-down round and
     the task will be re-executed in its current round. This also removes the
     `IllegalTaskStateException` variant and the equivalent race on canceled 
tasks.
   
   A WARN line is logged in both cases (with status and owner) so the situation 
stays observable.
   
   ## How was this patch tested?
   
   - [x] Unit reproductions in `TestOptimizingQueue` via the real `pollTask` 
path: multi-task
     (stale ack is rejected) and single-task (stale completion is ignored).
   - [x] End-to-end reproduction in `TestDefaultOptimizingService` driving the 
real `OptimizerKeeper`:
     a live optimizer (kept alive by heartbeats) leaves a `SCHEDULED` task 
unacked past the ack
     timeout, the keeper resets it, and the late ack is rejected as expected. 
With the fix reverted,
     this test reproduces the issue's exact log sequence and stack trace -- 
same messages and line
     numbers, including the misleading "optimizer is expired" log for an 
optimizer that is in fact
     still alive.
   - [x] Full `TestOptimizingQueue` regression passes.
   
   ## Documentation
   
   - Does this pull request introduce a new feature? No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to