lintingbin opened a new pull request, #4224:
URL: https://github.com/apache/amoro/pull/4224
## Why are the changes needed?
Close #4223.
`IcebergProcessFactory.recover()` was a stub that unconditionally threw
`RecoverProcessFailedException` for every action. Because
`ProcessService.recoverProcesses()` had no per-record error handling, the
exception (wrapped as `IllegalStateException` by `DefaultActionCoordinator`)
propagated out of the synchronous AMS startup path:
`AmoroServiceContainer.main` → `transitionToLeader` →
`startOptimizingService` → `DefaultTableService.initialize` →
`RuntimeHandlerChain.initialize` → `ProcessService.initialize` →
`recoverProcesses()`.
As a result, after an AMS restart, a single `SUBMITTED`/`RUNNING` Iceberg
`expire-snapshots` or `clean-orphan-files` record in `table_process` made AMS
fail to start and enter a crash loop with no automatic recovery. This is a
regression introduced by the recent ProcessFactory refactor (#4107, #4209).
Reported stack trace:
```
java.lang.IllegalStateException: Failed to recover table process for format
ICEBERG,
action org.apache.amoro.Action@86552c00, table
iceberg_catalog.ods.xxxxx(tableId=10)
at
org.apache.amoro.server.table.DefaultActionCoordinator.recoverTableProcess(DefaultActionCoordinator.java:118)
Caused by: org.apache.amoro.process.RecoverProcessFailedException:
Unsupported action for IcebergProcessFactory:
org.apache.amoro.Action@86552c00
at
org.apache.amoro.server.process.iceberg.IcebergProcessFactory.recover(IcebergProcessFactory.java:102)
```
## Brief change log
- **`IcebergProcessFactory.recover()`**: implemented. It now dispatches on
the persisted action and rebuilds `SnapshotsExpiringProcess` /
`OrphanFilesCleaningProcess` bound to the local execution engine. Both are
stateless, idempotent one-shot local maintenance tasks (no checkpoint), so
re-running them on recovery is safe. An unsupported action or a missing local
engine still throws `RecoverProcessFailedException` with a clear message.
- **`ProcessService.recoverProcesses()`**: hardened. Each record is now
recovered in its own `try/catch`; on failure it is logged, the offending record
is marked `FAILED` and skipped, so one un-recoverable record can no longer
abort the whole AMS startup. The affected periodic maintenance action is
re-scheduled normally by its scheduler.
- **`Action.toString()`**: added, returning the action name, so diagnostic
logs print a readable name (e.g. `EXPIRE-SNAPSHOTS`) instead of
`org.apache.amoro.Action@86552c00`.
## How was this patch tested?
- [x] Add some test cases that check the changes thoroughly including
negative and positive cases if possible
- `TestIcebergProcessFactory`: recover of `expire-snapshots` /
`clean-orphan-files` returns the right process bound to the local engine;
unsupported action and missing local engine throw
`RecoverProcessFailedException`.
- `TestAction`: `toString()` returns the action name.
- `TestDefaultProcessService#testRecoverProcessFailSafe`: a coordinator
whose `recover()` always fails no longer aborts `recoverProcesses()`; the bad
record is marked `FAILED` and a subsequent restart neither throws nor re-picks
it.
- [x] Run test locally before making a pull request (`amoro-common` +
`amoro-ams` affected tests pass; spotless & checkstyle clean)
## Documentation
- Does this pull request introduce a new feature? (no)
- If yes, how is the feature documented? (not applicable)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]