This is an automated email from the ASF dual-hosted git repository. wu-sheng pushed a commit to branch dsl-debugging-terminal-value-and-cap in repository https://gitbox.apache.org/repos/asf/skywalking.git
commit f62e511f3a4bb98957ba21fdcaa1704c32580489 Author: Wu Sheng <[email protected]> AuthorDate: Sat May 9 15:49:09 2026 +0800 DSL debugger (MAL): surface terminal payload.value (with two-phase calculate); reduce session record cap to 100 - MAL terminal output sample now carries payload.value rendered per holder type: Long/Int/Double scalars as JSON numbers (NaN / +-Infinity as strings), LabeledValueHolder DataTables as JSON object {key: long}. Force Metrics.calculate() at probe time for two-phase functions (AvgHistogramPercentileFunction / SumHistogramPercentileFunction) so the captured value carries the percentile result keyed by p=<rank> rather than the empty pre-calculate map. - Per-session record cap dropped from 10000 (default 1000) to 100 hard cap = default. ~1 MiB worst-case heap per session and a UI-readable page; operators inspect a handful of executions, not a paginated firehose. SessionLimits constructor still rejects out-of-range with IllegalArgumentException -> 400 invalid_limits. - Both changes are debug-only and zero-cost when no session is installed: the captureMeterEmit probe gate is JIT-elided on hot paths, calculate() is idempotent via the function's isCalculated guard, and combine() resets isCalculated=false so cluster-merged state still recomputes on read. - Updated the MAL admin-API doc with a concrete payload example and the SWIP-13 reference cap line / heap math. - E2E MAL flow asserts payload.value is present and numeric on every terminal output sample. - All 16 existing dsl-debugging unit tests pass. --- .../setup/backend/admin-api/dsl-debugging-mal.md | 12 ++- docs/en/swip/SWIP-13.md | 23 ++--- .../dsl/debugging/mal/MALDebugRecorderImpl.java | 97 ++++++++++++++++++++++ .../debugging/session/AbstractDebugRecorder.java | 2 +- .../admin/dsl/debugging/session/SessionLimits.java | 13 +-- .../session/DebugSessionRegistryTest.java | 2 +- .../cases/dsl-debugging/mal/dsl-debug-flow.sh | 15 ++++ 7 files changed, 144 insertions(+), 20 deletions(-) diff --git a/docs/en/setup/backend/admin-api/dsl-debugging-mal.md b/docs/en/setup/backend/admin-api/dsl-debugging-mal.md index d0200485d2..ac3ed75167 100644 --- a/docs/en/setup/backend/admin-api/dsl-debugging-mal.md +++ b/docs/en/setup/backend/admin-api/dsl-debugging-mal.md @@ -181,7 +181,17 @@ A trimmed slice (one record = one scrape window): { "type": "output", "sourceText": "e2e_demo_filtered_requests", "continueOn": true, - "payload": { /* terminal meter sample — metric, entity, value, timeBucket */ } } + "payload": { + "metric": "e2e_demo_filtered_requests", + "entity": "MeterEntity(scopeType=SERVICE, serviceName=my-svc, …)", + "valueType": "sum", + "timeBucket": 202605091036, + "value": 42 /* shape depends on valueType: + number for Sum/Avg/Max/Min/CPM/Latest…, + object {bucket: count} for histograms / + *Labeled functions, omitted for non-scalar + holders. NaN/±Infinity render as strings. */ + } } ] }] }] diff --git a/docs/en/swip/SWIP-13.md b/docs/en/swip/SWIP-13.md index ec22fd5c5e..2ad5c796ce 100644 --- a/docs/en/swip/SWIP-13.md +++ b/docs/en/swip/SWIP-13.md @@ -191,7 +191,7 @@ object but hasn't yet entered the streaming/aggregation kernel. `kind`, `component`) and in a multi-tenant or multi-component flow a rule routinely rejects 99% of the traffic routed to it by metric name. Publishing every rejected execution would burn `recordCap` -(default 1000) on noise within seconds and never reach a row that +(default 100) on noise within seconds and never reach a row that demonstrates the rule's actual processing. So for MAL the contract is: **every record represents one `SampleFamily` that passed the rule's filter and walked through to `meterEmit`**. Implementation: @@ -2434,7 +2434,7 @@ disappearance. | Cap | Default | Mechanism | |--------------------------------------|--------------------------------------|------------------------------------------------------------------------| | **Max active sessions per node** | **200** | hard ceiling; `POST /dsl-debugging/session` returns 429 too_many_sessions when full | -| **Records per session** | **default 1000, hard cap 10000** | recorder stops appending once the count is hit and moves the session to CAPTURED. Out-of-range request returns 400 invalid_limits. | +| **Records per session** | **default 100, hard cap 100** | recorder stops appending once the count is hit and moves the session to CAPTURED. Out-of-range request returns 400 invalid_limits. The cap is deliberately small — operators inspect a handful of executions, not a paginated firehose; per-session heap stays at ~1 MiB and the rendered UI page stays readable. | | Capture window (retention) | default 5 min, hard cap 1 hour | per-session retention timeout; sweeper drops the payload at window end | | Capture-call cost when idle | one volatile-load + branch (gate) | JIT eliminates the call site when gate is false | @@ -2452,19 +2452,19 @@ captures. #### Why 200 active sessions is the right ceiling 200 sessions is a defensible upper bound for an OAP node. With the -hard `recordCap = 10000` and a typical per-record JSON payload of +hard `recordCap = 100` and a typical per-record JSON payload of ~10 KiB (MAL is the largest of the three), the worst-case footprint across all active sessions on one node is roughly: ``` worst-case heap ≈ MAX_ACTIVE_SESSIONS × MAX_RECORD_CAP × per-record-bytes - ≈ 200 × 10 000 × 10 KiB ≈ 20 GiB (theoretical) + ≈ 200 × 100 × 10 KiB ≈ 200 MiB (theoretical) ``` -That's the worst-case product; realistic usage is one to two orders of -magnitude below because most captures stop on retention (5 min default) -or operator stop long before hitting `recordCap`, and most records are -several KiB rather than the worst-case 10. The 200 default leaves +That's the worst-case product; realistic usage is well below it +because most captures stop on retention (5 min default) or operator +stop long before hitting `recordCap`, and most records are several +KiB rather than the worst-case 10. The 200 default leaves headroom for many concurrent operators (one debug context = one sessionId; the UI maintains a single session per debug widget and reuses it for polls) without letting a runaway script exhaust the @@ -2591,9 +2591,10 @@ knobs — they are hard-coded SWIP contract values: receiving node also runs the prior-session cleanup pass for the supplied `clientId` BEFORE counting toward this ceiling, so a single client repeatedly clicking Start sampling cannot itself trigger 429. -- `MAX_RECORD_CAP = 10000` per session — request bodies asking for more - return `400 invalid_limits`. Session retention is similarly capped - at 1 hour (`MAX_RETENTION_MILLIS`). +- `MAX_RECORD_CAP = 100` per session — request bodies asking for more + return `400 invalid_limits`. The default also resolves to 100 (capped + by the hard ceiling). Session retention is similarly capped at 1 hour + (`MAX_RETENTION_MILLIS`). There is intentionally **no per-session byte cap and no structural char / sample / label sub-caps**. Operators sizing for memory pressure diff --git a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/mal/MALDebugRecorderImpl.java b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/mal/MALDebugRecorderImpl.java index 800362c2c1..efd15c0a15 100644 --- a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/mal/MALDebugRecorderImpl.java +++ b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/mal/MALDebugRecorderImpl.java @@ -28,6 +28,12 @@ import org.apache.skywalking.oap.server.admin.dsl.debugging.session.SessionLimit import org.apache.skywalking.oap.server.core.analysis.meter.MeterEntity; import org.apache.skywalking.oap.server.core.analysis.meter.function.AcceptableValue; import org.apache.skywalking.oap.server.core.analysis.meter.function.MeterFunction; +import org.apache.skywalking.oap.server.core.analysis.metrics.DataTable; +import org.apache.skywalking.oap.server.core.analysis.metrics.DoubleValueHolder; +import org.apache.skywalking.oap.server.core.analysis.metrics.IntValueHolder; +import org.apache.skywalking.oap.server.core.analysis.metrics.LabeledValueHolder; +import org.apache.skywalking.oap.server.core.analysis.metrics.LongValueHolder; +import org.apache.skywalking.oap.server.core.analysis.metrics.Metrics; import org.apache.skywalking.oap.server.core.dsldebug.GateHolder; import org.apache.skywalking.oap.server.core.dsldebug.RuleKey; @@ -185,9 +191,100 @@ public final class MALDebugRecorderImpl extends AbstractDebugRecorder obj.addProperty("entity", entity == null ? null : entity.toString()); obj.addProperty("valueType", resolveFunctionName(value)); obj.addProperty("timeBucket", timeBucket); + appendValue(obj, value); return obj.toString(); } + /** + * Surface the metric's terminal reading on the captured {@code output} + * sample so the operator sees the actual MAL emission, not just the + * function name. The shape of {@code "value"} depends on the holder + * the generated function class implements: + * <ul> + * <li>{@link LongValueHolder} / {@link IntValueHolder} → JSON number + * (Sum, Avg, Max, Min, Latest, SumPerMin, …).</li> + * <li>{@link DoubleValueHolder} → JSON number; non-finite values + * (NaN, ±Infinity) render as the corresponding string so the + * wire stays valid JSON and the reading is still visible.</li> + * <li>{@link LabeledValueHolder} → JSON object {@code {key: long}} + * — labeled metrics ({@code *Labeled}) and histogram/percentile + * functions ({@code AvgHistogramPercentileFunction}, + * {@code SumHistogramPercentileFunction}) whose reading is a + * {@link DataTable}. Keys are label combos for {@code *Labeled}, + * {@code p=<rank>} entries for percentile functions.</li> + * </ul> + * If {@code value} is null or not one of the recognised holders the + * field is omitted; the operator still sees {@code valueType} and can + * tell from the function name that the shape is non-scalar. + * + * <p><b>Two-phase functions.</b> Some functions split work between + * {@code accept()} and {@code calculate()} — accept() populates raw + * aggregates (e.g. {@code summation} + {@code count} for histogram- + * percentile), and calculate() turns those into the user-visible + * field returned by {@code getValue()} (e.g. {@code percentileValues}). + * The MAL {@code captureMeterEmit} probe fires AFTER accept() but + * BEFORE the streaming pipeline calls calculate(), so without forcing + * calculate() here the labeled value column would be an empty map for + * histogram-percentile rules — exactly when operators most need to + * verify what the rule emits. We force calculate() at probe time so + * the captured value matches what the storage row will contain. This + * is safe and zero-cost when debug is off: + * <ul> + * <li>The probe site itself is gated; when no operator has installed + * a session the call into this method is dead code that the JIT + * elides on hot-path inlining.</li> + * <li>{@code calculate()} is idempotent via the function's + * {@code isCalculated} guard, so the streaming pipeline's later + * call is a no-op rather than double work.</li> + * <li>Cross-node {@code combine()} resets {@code isCalculated=false}, + * so the post-merge calculate on combined state still happens + * on read — pre-computing here on the local snapshot doesn't + * leak stale values into the cluster.</li> + * </ul> + */ + private static void appendValue(final JsonObject obj, final AcceptableValue<?> value) { + if (value == null) { + return; + } + // Force two-phase functions to compute their user-visible reading + // before we read getValue(). See the javadoc above for why this is + // safe and only paid when a debug session is active. + if (value instanceof Metrics) { + ((Metrics) value).calculate(); + } + // Order matters: LabeledValueHolder is checked before scalar holders + // because some labeled functions could in principle implement both; + // the labeled (DataTable) view is the operator-meaningful one. + if (value instanceof LabeledValueHolder) { + final DataTable table = ((LabeledValueHolder) value).getValue(); + if (table == null) { + return; + } + final JsonObject map = new JsonObject(); + for (final String key : table.keys()) { + final Long v = table.get(key); + if (v != null) { + map.addProperty(key, v); + } + } + obj.add("value", map); + } else if (value instanceof LongValueHolder) { + obj.addProperty("value", ((LongValueHolder) value).getValue()); + } else if (value instanceof IntValueHolder) { + obj.addProperty("value", ((IntValueHolder) value).getValue()); + } else if (value instanceof DoubleValueHolder) { + final double v = ((DoubleValueHolder) value).getValue(); + if (Double.isFinite(v)) { + obj.addProperty("value", v); + } else { + // Gson rejects NaN / ±Infinity as numbers — surface them as + // strings so an operator inspecting a divide-by-zero or + // empty-window emit can still see the actual reading. + obj.addProperty("value", Double.toString(v)); + } + } + } + /** * Resolve the human-readable MAL function name (e.g. {@code sum}, * {@code avg}, {@code maxHistogram}) by walking the class hierarchy diff --git a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/AbstractDebugRecorder.java b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/AbstractDebugRecorder.java index 41e741e836..3014171a47 100644 --- a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/AbstractDebugRecorder.java +++ b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/AbstractDebugRecorder.java @@ -167,7 +167,7 @@ public abstract class AbstractDebugRecorder implements DebugRecorder { * predicate for OAL) decides whether the rule actually processes it. * On tag-discriminating rules the rejection ratio can hit 99% — most * traffic isn't relevant to the rule. If we published rejected - * executions, recordCap (default 1000) would fill with garbage before + * executions, recordCap (default 100) would fill with garbage before * any meaningful "what did the rule actually do" record showed up. So we * publish only kept executions: every record in {@code records[]} * represents one source/family that survived the rule's filter and diff --git a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/SessionLimits.java b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/SessionLimits.java index 6920d7abc1..1cdaf6c7f6 100644 --- a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/SessionLimits.java +++ b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/SessionLimits.java @@ -40,12 +40,13 @@ public final class SessionLimits { /** * Hard upper bound on per-session retained records. The recorder appends - * one record per probe pass and stops once the count reaches the cap; - * payloads can be ~10 KiB for richly-tagged MAL/LAL flows, so 10k records - * caps the per-session heap footprint at ~100 MiB even before the - * captured-at-cap auto-detach kicks in. + * one record per probe pass and stops once the count reaches the cap. + * Payloads can be ~10 KiB for richly-tagged MAL/LAL flows, so 100 records + * caps the per-session heap footprint at ~1 MiB and keeps the rendered + * UI page readable — operators inspect a handful of executions, not a + * paginated firehose, so a small cap is the right product shape. */ - public static final int MAX_RECORD_CAP = 10_000; + public static final int MAX_RECORD_CAP = 100; /** * Hard upper bound on the per-session retention window (1 hour). Sessions @@ -55,7 +56,7 @@ public final class SessionLimits { public static final long MAX_RETENTION_MILLIS = 60L * 60 * 1000; public static final SessionLimits DEFAULT = - new SessionLimits(1_000, 5L * 60 * 1000, Granularity.DEFAULT); + new SessionLimits(MAX_RECORD_CAP, 5L * 60 * 1000, Granularity.DEFAULT); private final int recordCap; private final long retentionMillis; diff --git a/oap-server/server-admin/dsl-debugging/src/test/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/DebugSessionRegistryTest.java b/oap-server/server-admin/dsl-debugging/src/test/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/DebugSessionRegistryTest.java index b6941fcb1a..bb9ea1763d 100644 --- a/oap-server/server-admin/dsl-debugging/src/test/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/DebugSessionRegistryTest.java +++ b/oap-server/server-admin/dsl-debugging/src/test/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/DebugSessionRegistryTest.java @@ -186,7 +186,7 @@ public class DebugSessionRegistryTest { final RuleKey key = new RuleKey(Catalog.OTEL_RULES, "vm", "cpu"); final GateHolder holder = new GateHolder("hash-1"); final DebugSessionRegistry registry = registryWith(holder, key); - final SessionLimits shortRetention = new SessionLimits(1_000, 1L); + final SessionLimits shortRetention = new SessionLimits(SessionLimits.MAX_RECORD_CAP, 1L); registry.install(key, "client-a", shortRetention); Thread.sleep(5); diff --git a/test/e2e-v2/cases/dsl-debugging/mal/dsl-debug-flow.sh b/test/e2e-v2/cases/dsl-debugging/mal/dsl-debug-flow.sh index f4e8f23c27..267234d760 100755 --- a/test/e2e-v2/cases/dsl-debugging/mal/dsl-debug-flow.sh +++ b/test/e2e-v2/cases/dsl-debugging/mal/dsl-debug-flow.sh @@ -199,6 +199,21 @@ last_with_metric="$(echo "${collect_body}" | jq --arg n "${METRIC_NAME}" \ [ "${last_with_metric}" -gt 0 ] \ || fail "no execution closes with a meterEmit sample carrying payload.metric=${METRIC_NAME}" +# meterEmit must surface the actual reading on payload.value. The seed rule +# expression is `request_count + pool_size + decoy.sum(['service_name'])` — +# all three operands are LongValueHolder Sum/Avg outputs, so the reading +# resolves to a JSON number (not a labeled DataTable, not NaN). Asserting +# the field is present and numeric proves operators see the actual MAL +# emission, not just the function name. +missing_value="$(echo "${collect_body}" | jq --arg n "${METRIC_NAME}" \ + '[.nodes[].records[] | .samples[-1] + | select(.payload.metric == $n) + | select(has("payload")) + | select((.payload.value == null) or ((.payload.value | type) != "number"))] + | length')" +[ "${missing_value}" = "0" ] \ + || fail "${missing_value} terminal meterEmit sample(s) missing numeric payload.value" + log "✓ MAL shape valid (${records_count} records, ${total_samples} samples)" # --- Phase 6: stop session ------------------------------------------------------------
