wu-sheng opened a new pull request, #13909: URL: https://github.com/apache/skywalking/pull/13909
### Fix runtime-rule (MAL/LAL hot-update) schema changes in `no-init` mode, and the runtime-rule cluster node-identity collision on Kubernetes - [x] Add a unit test to verify that the fix works. - [x] Explain briefly why the bug exists and how to fix it. Two bugs in the runtime-rule (DSL hot-update) cluster path, both confirmed end-to-end on a local kind cluster: **1. Runtime-rule schema changes were inoperative in `no-init` mode** — the mode every production OAP cluster runs (a one-shot `-Dmode=init` Job creates the static schema; the OAP Deployment runs `-Dmode=no-init`). A runtime `addOrUpdate` introducing a new metric blocked forever in the storage installer's init-node poll loop (`ModelInstaller.whenCreating`), because the loop was gated on `RunningMode` rather than the operation's intent. `/delete?mode=revertToBundled` recreate and BanyanDB in-place shape updates were dead the same way. **Fix:** a new `StorageManipulationOpt.Flags.deferDDLToInitNode` bit, set only on the static boot-time `schemaCreateIfAbsent()` opt (DRYed into `ModelInstaller.deferDDLToInitNode(opt)`, reused by the BanyanDB shape-check / group-DDL gates). The runtime-rule opts (`withSchemaChange` / `verifySchemaOnly` / `withoutSchemaChange`) are now driven by their flags and by cluster main-ness — `no-init` and `default` no longer differ for DSL DDL; `init` stay s the dedicated initializer. `DSLManager.tickStorageOpt` is collapsed accordingly. **2. Runtime-rule cross-node writes failed with `HTTP 400 forward_self_loop` on a multi-replica Kubernetes cluster.** Every OAP replica shared the cluster `selfNodeId` `0.0.0.0_11800` (derived from the `0.0.0.0` agent gRPC bind host via `TelemetryRelatedContext`), so the main's self-loop guard rejected a legitimate peer-to-peer Forward as if it had looped back. **Fix:** resolve the runtime-rule node identity from the unique per-pod `SKYWALKING_COLLECTOR_UID` (the pod UID injected by the helm chart / swck operator from `metadata.uid`), in `start()` before any apply; falls back to the telemetry id off-Kubernetes. `MainRouter` already routes correctly off the cluster peer addresses (pod IPs); only the self-loop identity needed to be unique. **Tests:** new `ModelInstallerNoInitTest` (UT) for the no-init create chokepoint; the runtime-rule cluster e2e is converted from docker-compose (default mode — which never exercised either bug) to a kind + skywalking-helm `no-init` cluster (`oap.replicas=2`) driving the apply / STRUCTURAL / inactivate / delete lifecycle, cross-node convergence, and the cross-node Forward path. - [ ] If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #<issue number>. - [x] Update the [`CHANGES` log](https://github.com/apache/skywalking/blob/master/docs/en/changes/changes.md). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
