lintingbin opened a new pull request, #4224:
URL: https://github.com/apache/amoro/pull/4224

   ## Why are the changes needed?
   
   Close #4223.
   
   `IcebergProcessFactory.recover()` was a stub that unconditionally threw 
`RecoverProcessFailedException` for every action. Because 
`ProcessService.recoverProcesses()` had no per-record error handling, the 
exception (wrapped as `IllegalStateException` by `DefaultActionCoordinator`) 
propagated out of the synchronous AMS startup path:
   
   `AmoroServiceContainer.main` → `transitionToLeader` → 
`startOptimizingService` → `DefaultTableService.initialize` → 
`RuntimeHandlerChain.initialize` → `ProcessService.initialize` → 
`recoverProcesses()`.
   
   As a result, after an AMS restart, a single `SUBMITTED`/`RUNNING` Iceberg 
`expire-snapshots` or `clean-orphan-files` record in `table_process` made AMS 
fail to start and enter a crash loop with no automatic recovery. This is a 
regression introduced by the recent ProcessFactory refactor (#4107, #4209).
   
   Reported stack trace:
   
   ```
   java.lang.IllegalStateException: Failed to recover table process for format 
ICEBERG,
     action org.apache.amoro.Action@86552c00, table 
iceberg_catalog.ods.xxxxx(tableId=10)
     at 
org.apache.amoro.server.table.DefaultActionCoordinator.recoverTableProcess(DefaultActionCoordinator.java:118)
   Caused by: org.apache.amoro.process.RecoverProcessFailedException:
     Unsupported action for IcebergProcessFactory: 
org.apache.amoro.Action@86552c00
     at 
org.apache.amoro.server.process.iceberg.IcebergProcessFactory.recover(IcebergProcessFactory.java:102)
   ```
   
   ## Brief change log
   
   - **`IcebergProcessFactory.recover()`**: implemented. It now dispatches on 
the persisted action and rebuilds `SnapshotsExpiringProcess` / 
`OrphanFilesCleaningProcess` bound to the local execution engine. Both are 
stateless, idempotent one-shot local maintenance tasks (no checkpoint), so 
re-running them on recovery is safe. An unsupported action or a missing local 
engine still throws `RecoverProcessFailedException` with a clear message.
   - **`ProcessService.recoverProcesses()`**: hardened. Each record is now 
recovered in its own `try/catch`; on failure it is logged, the offending record 
is marked `FAILED` and skipped, so one un-recoverable record can no longer 
abort the whole AMS startup. The affected periodic maintenance action is 
re-scheduled normally by its scheduler.
   - **`Action.toString()`**: added, returning the action name, so diagnostic 
logs print a readable name (e.g. `EXPIRE-SNAPSHOTS`) instead of 
`org.apache.amoro.Action@86552c00`.
   
   ## How was this patch tested?
   
   - [x] Add some test cases that check the changes thoroughly including 
negative and positive cases if possible
     - `TestIcebergProcessFactory`: recover of `expire-snapshots` / 
`clean-orphan-files` returns the right process bound to the local engine; 
unsupported action and missing local engine throw 
`RecoverProcessFailedException`.
     - `TestAction`: `toString()` returns the action name.
     - `TestDefaultProcessService#testRecoverProcessFailSafe`: a coordinator 
whose `recover()` always fails no longer aborts `recoverProcesses()`; the bad 
record is marked `FAILED` and a subsequent restart neither throws nor re-picks 
it.
   - [x] Run test locally before making a pull request (`amoro-common` + 
`amoro-ams` affected tests pass; spotless & checkstyle clean)
   
   ## Documentation
   
   - Does this pull request introduce a new feature? (no)
   - If yes, how is the feature documented? (not applicable)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to