j1wonpark opened a new issue, #4253:
URL: https://github.com/apache/amoro/issues/4253

   ## Search before asking
   
   - [x] I have searched in the 
[issues](https://github.com/apache/amoro/issues?q=is%3Aissue) and found no 
similar issues.
   
   ## Description
   
   This is the first implementation phase of **AIP-5: Dynamic Resource 
Allocation for Optimizer**. It lays the configuration foundation that the later 
scaling phases build on, with no runtime scaling behavior yet. Dynamic 
allocation is opt-in and disabled by default (`dynamic-allocation.enabled = 
false`), so existing groups are unaffected and this phase is safe to merge 
independently. The full `dynamic-allocation.*` property set is declared here so 
the configuration surface and its validation are settled in one place, rather 
than drip-fed per phase.
   
   Scope:
   
   1. **Configuration properties.** Declare the `dynamic-allocation.*` 
properties as constants in `OptimizerProperties`, each annotated with `@since`:
      - `dynamic-allocation.enabled` (default `false`)
      - `dynamic-allocation.min-parallelism` (default `0`)
      - `dynamic-allocation.max-parallelism` (required when enabled; 
hard-capped at `1024`)
      - `dynamic-allocation.scheduler-backlog-timeout` (default `1min`)
      - `dynamic-allocation.sustained-backlog-timeout` (default `30s`)
      - `dynamic-allocation.executor-idle-timeout` (default `5min`, minimum 
`30s`)
      - `dynamic-allocation.scale-down-cooldown` (default `1min`)
      - `dynamic-allocation.drain-timeout` (default `15min`)
   
   2. **`DynamicAllocationConfig` parsing + validation.** A new 
`DynamicAllocationConfig.parse(ResourceGroup)` / `validate()` that enforces:
      - `max-parallelism` is mandatory when enabled;
      - `min-parallelism ≤ max-parallelism ≤ 1024`;
      - `executor-idle-timeout ≥ 30s`;
      - all durations parse to positive values;
      - DRA is not enabled on an externally-registered optimizer (conservative 
check in this phase: reject the `external` container).
   
   3. **Failure handling.**
      - **REST create/update — fail fast.** 
`OptimizerGroupController.createResourceGroup` / `updateResourceGroup` reject 
an invalid DRA config with **HTTP 400**. (The update path previously had no 
property-value validation at all.)
      - **Startup load — fail safe, never silent.** A persisted group with an 
invalid DRA config no longer crashes AMS; it logs a warning and falls back to 
DRA-disabled behavior. (The `optimizer_group_config_invalid` gauge is wired in 
the observability phase.)
   
   4. **`min-parallelism` deprecation.** The flat `min-parallelism` property is 
deprecated in favor of `dynamic-allocation.min-parallelism` but still honored. 
Resolution order: namespaced → legacy → `0`. A one-off deprecation warning is 
logged at config-entry points (startup load, REST), not on the keeper hot path.
   
   5. **Auto-reset compatibility.** The keeper already auto-resets a group's 
`min-parallelism` after `optimizer-group.max-keeping-attempts` consecutive 
failed optimizer creations. Two adjustments keep this consistent with the new 
property model:
      - **Disabled when DRA is effectively enabled.** Once a group opts into 
dynamic allocation (and its config is valid), DRA owns the group's scale 
decisions, so the keeper no longer erodes its `min-parallelism` floor — it 
keeps retrying instead. "Effectively enabled" means opted-in *and* valid; an 
invalid config is treated as disabled, matching the startup fail-safe, so such 
groups still auto-reset as before.
      - **Writes the key the group actually uses.** When auto-reset does apply, 
it now writes whichever `min-parallelism` key the group reads from (namespaced 
if present, else the legacy flat key). Previously it always wrote the flat key; 
for a group using the namespaced property that write was shadowed by resolution 
order, turning auto-reset into an endless no-op loop (repeated warning + DB 
update with no effect).
   
   Out of scope for this phase (later phases): demand-driven scale-up, idle 
tracking, scale-down + graceful drain, and the new metrics. The 
`config_invalid` gauge and `scale_*_total` counters land in the observability 
phase.
   
   ## Parent issue
   
   AIP-5: https://github.com/apache/amoro/issues/4191
   
   ## Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ## Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to