j1wonpark opened a new issue, #4089:
URL: https://github.com/apache/amoro/issues/4089

   ### What happened?
   
   When AMS is restarted (e.g., during a Kubernetes rolling update), in-flight 
Optimizing Processes for tables never complete. The table remains stuck in 
`MAJOR_OPTIMIZING`/`MINOR_OPTIMIZING`/`FULL_OPTIMIZING` status permanently, 
preventing any new optimization from being scheduled.
   
   **Expected:** After AMS restart, SCHEDULED/ACKED tasks should be 
automatically reset to PLANNED and re-queued, allowing the Optimizing Process 
to complete normally.
   
   **Actual:** SCHEDULED/ACKED tasks are loaded into `taskMap` but never placed 
into `taskQueue`, leaving them permanently orphaned.
   
   
   ### Affects Versions
   
   master
   
   ### What table formats are you seeing the problem on?
   
   Iceberg, Mixed-Iceberg, Paimon, Mixed-Hive
   
   ### What engines are you seeing the problem on?
   
   AMS, Optimizer
   
   ### How to reproduce
   
   1. Register a table in AMS and start an Optimizer
   2. Insert data into the table to trigger self-optimizing
   3. While the Optimizer is executing tasks (tasks in SCHEDULED or ACKED 
state), restart the AMS process
   4. After AMS restarts, the table's Optimizing Process never completes and 
the table stays in `MAJOR_OPTIMIZING` status
   
   ### Relevant log output
   
   ```shell
   No error logs are generated — the tasks silently remain in ACKED/SCHEDULED 
state without any retry or timeout warning.
   ```
   
   ### Anything else
   
   **Timeline during AMS restart:**
   
   ```
   T0: AMS Pod Terminating (rolling update starts)
       - Optimizer is executing Task T1 (ACKED state)
   
   T1: AMS Pod Down
       - Optimizer completes T1 → completeTask() fails (AMS unavailable)
   
   T2: New AMS Pod Starting
       - Loads old optimizer record from DB (token-A, stale touchTime)
       - registerOptimizer(token-A) → added to authOptimizers
       - loadTaskRuntimes: T1 loaded as ACKED into taskMap only (not re-queued)
   
   T3: OptimizerKeeper processes expired optimizer
       - collectTasks: token-A still in authOptimizers → T1 NOT detected
       - unregisterOptimizer(token-A) → token-A removed
   
   T4: Task T1 permanently stuck in ACKED state
       - No more keeper events (suspendingQueue is empty)
       - allTasksPrepared() → false → table stuck forever
   ```
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's Code of Conduct


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to