j1wonpark opened a new pull request, #4254:
URL: https://github.com/apache/amoro/pull/4254
## Why are the changes needed?
Close #4253.
This is the first implementation phase of **AIP-5: Dynamic Resource
Allocation for Optimizer** (#4191). It lays the configuration foundation that
the later scaling phases build on. There is **no runtime scaling behavior in
this PR** — it is purely opt-in configuration plumbing, so it is safe to merge
independently and has no effect on existing groups
(`dynamic-allocation.enabled` defaults to `false`).
Splitting AIP-5 into self-contained PRs follows the design's five linear
phases; demand-driven scale-up, idle tracking, scale-down + graceful drain, and
the new metrics land in subsequent PRs. The full set of `dynamic-allocation.*`
properties is declared here (rather than drip-fed per phase) so the public
configuration surface and its validation are settled in one place: a group
created between phases cannot carry an unvalidated DRA property. A few getters
(the timeout durations, `max-parallelism`) therefore have no caller until the
phase that consumes them — this is intentional, not dead code.
## Brief change log
- **`OptimizerProperties`**: declared the `dynamic-allocation.*` constants
(each with a `@since` tag) — `enabled`, `min-parallelism`, `max-parallelism` (+
the `1024` hard-cap constant), `scheduler-backlog-timeout`,
`sustained-backlog-timeout`, `executor-idle-timeout` (+ its `30s` minimum),
`scale-down-cooldown`, `drain-timeout`. Deprecated the flat `min-parallelism`
in favor of the namespaced `dynamic-allocation.min-parallelism`; it is still
honored as a fallback.
- **`DynamicAllocationConfig`** (new): `parse(ResourceGroup)` builds a
typed, immutable config from the group's properties; `validate()` enforces the
AIP-5 constraints **only when DRA is enabled** — `max-parallelism` is
mandatory, `min-parallelism ≤ max-parallelism ≤ 1024` (an over-limit value is
rejected, not silently clamped), `executor-idle-timeout ≥ 30s`, all durations
positive, and DRA is rejected on an externally-registered optimizer
(conservative check this phase: the `external` container, which AMS cannot
scale). `resolveMinParallelism()` centralizes the namespaced → legacy → `0`
resolution; a one-off deprecation warning is emitted at config-entry points,
not on the keeper hot path.
- **`OptimizerGroupController`**: `createResourceGroup` /
`updateResourceGroup` now reject an invalid DRA config with **HTTP 400** before
persisting. (The update path previously had no property-value validation at
all.)
- **`DefaultOptimizingService`**:
- Startup load is **fail-safe, never silent**: a persisted group with an
invalid DRA config no longer crashes AMS — it logs a warning and falls back to
DRA-disabled behavior. (The `optimizer_group_config_invalid` gauge is wired in
the observability phase.)
- `getMinParallelism()` now delegates to `resolveMinParallelism()`,
preserving the existing lenient behavior while honoring the namespaced key.
- The keeper's existing `min-parallelism` auto-reset (after
`optimizer-group.max-keeping-attempts` failed creations) is adjusted for the
new property model, per the AIP's Compatibility section:
- It is **skipped when DRA is effectively enabled** (opted-in *and*
valid) — DRA owns the group's scale decisions, so the keeper no longer erodes
its `min-parallelism` floor and keeps retrying instead. An invalid config
counts as disabled, matching the startup fail-safe.
- When it does apply, it now writes **whichever key the group actually
reads** (namespaced if present, else the legacy flat key). Previously it always
wrote the flat key; for a group using the namespaced property that write was
shadowed by resolution order, turning auto-reset into an endless no-op loop
(repeated warning + DB update with no effect). This is fixed here.
## How was this patch tested?
- [x] Add some test cases that check the changes thoroughly including
negative and positive cases if possible
- `TestDynamicAllocationConfig`: rejection of
enabled-without-`max-parallelism`, `max < min`, over-`1024`, sub-`30s` idle
timeout, unparsable duration, zero/non-positive duration, and the `external`
container; correct parsing of a valid config; the namespaced → legacy → `0`
resolution order; `isEffectivelyEnabled` (valid+enabled → true, disabled →
false, enabled+invalid → false) and `effectiveMinParallelismKey`.
- `TestOptimizerGroupController`: invalid create/update return 400, valid
create succeeds.
- `TestOptimizerGroupKeeper`: auto-reset is skipped when DRA is enabled
(the floor is preserved); a group using the namespaced key has the reset
written to that key and no deprecated flat key is introduced; the existing
legacy-group auto-reset scenarios still pass (no regression).
- [ ] Add screenshots for manual tests if appropriate (not applicable — no
UI change)
- [x] Run test locally before making a pull request (affected `amoro-common`
+ `amoro-ams` tests pass — 35 tests green; spotless and checkstyle clean)
## Documentation
- Does this pull request introduce a new feature? (yes — AIP-5 foundation,
opt-in and disabled by default)
- If yes, how is the feature documented? (JavaDocs — `@since` tags and
`DynamicAllocationConfig` doc comments; the full design is in AIP-5 / #4191,
and user-facing configuration docs land with the phase that first exposes
runtime behavior)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]