wu-sheng opened a new pull request, #13911:
URL: https://github.com/apache/skywalking/pull/13911

   ### Fix BanyanDB runtime-rule schema-cache flood + v2 MAL CounterWindow 
collision and Elvis falsy semantics
   - [x] Add a unit test to verify that the fix works.
   - [x] Explain briefly why the bug exists and how to fix it.
   
   Bundles related correctness fixes surfaced while validating BanyanDB 
self-observability against the live demo.
   
   **1. BanyanDB schema-cache self-heal.** Peer nodes flooded `<metric> is not 
registered` when a node held a live persist worker but never populated its 
local `MetadataRegistry` schema cache for that model (a `withoutSchemaChange` 
peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but 
skipped the populate; the registry never evicts and the 30s reconcile only 
covers runtime-rule rows). The persist DAOs now self-heal a missing entry once 
with an RPC-free local re-derivation (`MetadataRegistry.repopulateLocally`) 
before failing, and the no-init defer poll loop retries a transient backend 
probe error (`isRetryableNoInitProbeFailure` — default `false`, BanyanDB opts 
in for transient gRPC codes) instead of crash-looping the pod.
   
   **2. v2 MAL `CounterWindow` key collision.** `rate()` / `increase()` / 
`irate()` keyed each counter's sliding window on the rule's output metric name 
(the same for every input metric of a rule) instead of the counter's own name. 
Two or more counters that reduce to the same label set after `.sum(...)` 
therefore shared one window slot and computed rates against each other's values 
— fabricating non-zero rates from unchanged counters (the BanyanDB liaison gRPC 
error rate read a steady non-zero off three frozen error counters). Fixed by 
keying on the counter's own metric name. `BanyanDBErrorRateReproTest` 
reproduces it with the real frozen values (966 → 0 after the fix).
   
   **3. v2 MAL Elvis `?:` falsy semantics.** Compiled to 
`Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only on 
`null`, so an empty-string primary kept `""` (a BanyanDB liaison 
`ServiceInstance` stored `node_type=""` rather than `n/a`, because 
`.sum([...,'node_type'])` fills an absent group-by label with `""`). Now 
single-evaluated through `MalRuntimeHelper.elvis` / `isTruthy`, matching Groovy 
falsy (null, false, numeric zero, empty string/collection/map/array). 
`MALElvisFalsyTest` covers empty/null/non-empty/side-effecting primaries.
   
   **4. banyandb otel-rules.** `PT15S` → `PT1M` rate window to match the 
collector scrape / OAP minute-bucket cadence (MAL `rate()` is a two-point 
`CounterWindow` delta, not PromQL).
   
   - [ ] If this pull request closes/resolves/fixes an existing issue, replace 
the issue number. Closes #<issue number>.
   - [x] Update the [`CHANGES` 
log](https://github.com/apache/skywalking/blob/master/docs/en/changes/changes.md).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to