j1wonpark opened a new pull request, #4254:
URL: https://github.com/apache/amoro/pull/4254

   ## Why are the changes needed?
   
   Close #4253.
   
   This is the first implementation phase of **AIP-5: Dynamic Resource 
Allocation for Optimizer** (#4191). It lays the configuration foundation that 
the later scaling phases build on. There is **no runtime scaling behavior in 
this PR** — it is purely opt-in configuration plumbing, so it is safe to merge 
independently and has no effect on existing groups 
(`dynamic-allocation.enabled` defaults to `false`).
   
   Splitting AIP-5 into self-contained PRs follows the design's five linear 
phases; demand-driven scale-up, idle tracking, scale-down + graceful drain, and 
the new metrics land in subsequent PRs. The full set of `dynamic-allocation.*` 
properties is declared here (rather than drip-fed per phase) so the public 
configuration surface and its validation are settled in one place: a group 
created between phases cannot carry an unvalidated DRA property. A few getters 
(the timeout durations, `max-parallelism`) therefore have no caller until the 
phase that consumes them — this is intentional, not dead code.
   
   ## Brief change log
   
   - **`OptimizerProperties`**: declared the `dynamic-allocation.*` constants 
(each with a `@since` tag) — `enabled`, `min-parallelism`, `max-parallelism` (+ 
the `1024` hard-cap constant), `scheduler-backlog-timeout`, 
`sustained-backlog-timeout`, `executor-idle-timeout` (+ its `30s` minimum), 
`scale-down-cooldown`, `drain-timeout`. Deprecated the flat `min-parallelism` 
in favor of the namespaced `dynamic-allocation.min-parallelism`; it is still 
honored as a fallback.
   
   - **`DynamicAllocationConfig`** (new): `parse(ResourceGroup)` builds a 
typed, immutable config from the group's properties; `validate()` enforces the 
AIP-5 constraints **only when DRA is enabled** — `max-parallelism` is 
mandatory, `min-parallelism ≤ max-parallelism ≤ 1024` (an over-limit value is 
rejected, not silently clamped), `executor-idle-timeout ≥ 30s`, all durations 
positive, and DRA is rejected on an externally-registered optimizer 
(conservative check this phase: the `external` container, which AMS cannot 
scale). `resolveMinParallelism()` centralizes the namespaced → legacy → `0` 
resolution; a one-off deprecation warning is emitted at config-entry points, 
not on the keeper hot path.
   
   - **`OptimizerGroupController`**: `createResourceGroup` / 
`updateResourceGroup` now reject an invalid DRA config with **HTTP 400** before 
persisting. (The update path previously had no property-value validation at 
all.)
   
   - **`DefaultOptimizingService`**:
     - Startup load is **fail-safe, never silent**: a persisted group with an 
invalid DRA config no longer crashes AMS — it logs a warning and falls back to 
DRA-disabled behavior. (The `optimizer_group_config_invalid` gauge is wired in 
the observability phase.)
     - `getMinParallelism()` now delegates to `resolveMinParallelism()`, 
preserving the existing lenient behavior while honoring the namespaced key.
     - The keeper's existing `min-parallelism` auto-reset (after 
`optimizer-group.max-keeping-attempts` failed creations) is adjusted for the 
new property model, per the AIP's Compatibility section:
       - It is **skipped when DRA is effectively enabled** (opted-in *and* 
valid) — DRA owns the group's scale decisions, so the keeper no longer erodes 
its `min-parallelism` floor and keeps retrying instead. An invalid config 
counts as disabled, matching the startup fail-safe.
       - When it does apply, it now writes **whichever key the group actually 
reads** (namespaced if present, else the legacy flat key). Previously it always 
wrote the flat key; for a group using the namespaced property that write was 
shadowed by resolution order, turning auto-reset into an endless no-op loop 
(repeated warning + DB update with no effect). This is fixed here.
   
   ## How was this patch tested?
   
   - [x] Add some test cases that check the changes thoroughly including 
negative and positive cases if possible
     - `TestDynamicAllocationConfig`: rejection of 
enabled-without-`max-parallelism`, `max < min`, over-`1024`, sub-`30s` idle 
timeout, unparsable duration, zero/non-positive duration, and the `external` 
container; correct parsing of a valid config; the namespaced → legacy → `0` 
resolution order; `isEffectivelyEnabled` (valid+enabled → true, disabled → 
false, enabled+invalid → false) and `effectiveMinParallelismKey`.
     - `TestOptimizerGroupController`: invalid create/update return 400, valid 
create succeeds.
     - `TestOptimizerGroupKeeper`: auto-reset is skipped when DRA is enabled 
(the floor is preserved); a group using the namespaced key has the reset 
written to that key and no deprecated flat key is introduced; the existing 
legacy-group auto-reset scenarios still pass (no regression).
   
   - [ ] Add screenshots for manual tests if appropriate (not applicable — no 
UI change)
   
   - [x] Run test locally before making a pull request (affected `amoro-common` 
+ `amoro-ams` tests pass — 35 tests green; spotless and checkstyle clean)
   
   ## Documentation
   
   - Does this pull request introduce a new feature? (yes — AIP-5 foundation, 
opt-in and disabled by default)
   - If yes, how is the feature documented? (JavaDocs — `@since` tags and 
`DynamicAllocationConfig` doc comments; the full design is in AIP-5 / #4191, 
and user-facing configuration docs land with the phase that first exposes 
runtime behavior)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to