This is an automated email from the ASF dual-hosted git repository.

wu-sheng pushed a commit to branch dsl-debugging-terminal-value-and-cap
in repository https://gitbox.apache.org/repos/asf/skywalking.git

commit f62e511f3a4bb98957ba21fdcaa1704c32580489
Author: Wu Sheng <[email protected]>
AuthorDate: Sat May 9 15:49:09 2026 +0800

    DSL debugger (MAL): surface terminal payload.value (with two-phase 
calculate); reduce session record cap to 100
    
    - MAL terminal output sample now carries payload.value rendered per holder
      type: Long/Int/Double scalars as JSON numbers (NaN / +-Infinity as 
strings),
      LabeledValueHolder DataTables as JSON object {key: long}. Force
      Metrics.calculate() at probe time for two-phase functions
      (AvgHistogramPercentileFunction / SumHistogramPercentileFunction) so the
      captured value carries the percentile result keyed by p=<rank> rather
      than the empty pre-calculate map.
    - Per-session record cap dropped from 10000 (default 1000) to 100 hard
      cap = default. ~1 MiB worst-case heap per session and a UI-readable
      page; operators inspect a handful of executions, not a paginated
      firehose. SessionLimits constructor still rejects out-of-range with
      IllegalArgumentException -> 400 invalid_limits.
    - Both changes are debug-only and zero-cost when no session is installed:
      the captureMeterEmit probe gate is JIT-elided on hot paths, calculate()
      is idempotent via the function's isCalculated guard, and combine()
      resets isCalculated=false so cluster-merged state still recomputes on
      read.
    - Updated the MAL admin-API doc with a concrete payload example and the
      SWIP-13 reference cap line / heap math.
    - E2E MAL flow asserts payload.value is present and numeric on every
      terminal output sample.
    - All 16 existing dsl-debugging unit tests pass.
---
 .../setup/backend/admin-api/dsl-debugging-mal.md   | 12 ++-
 docs/en/swip/SWIP-13.md                            | 23 ++---
 .../dsl/debugging/mal/MALDebugRecorderImpl.java    | 97 ++++++++++++++++++++++
 .../debugging/session/AbstractDebugRecorder.java   |  2 +-
 .../admin/dsl/debugging/session/SessionLimits.java | 13 +--
 .../session/DebugSessionRegistryTest.java          |  2 +-
 .../cases/dsl-debugging/mal/dsl-debug-flow.sh      | 15 ++++
 7 files changed, 144 insertions(+), 20 deletions(-)

diff --git a/docs/en/setup/backend/admin-api/dsl-debugging-mal.md 
b/docs/en/setup/backend/admin-api/dsl-debugging-mal.md
index d0200485d2..ac3ed75167 100644
--- a/docs/en/setup/backend/admin-api/dsl-debugging-mal.md
+++ b/docs/en/setup/backend/admin-api/dsl-debugging-mal.md
@@ -181,7 +181,17 @@ A trimmed slice (one record = one scrape window):
         { "type": "output",
           "sourceText": "e2e_demo_filtered_requests",
           "continueOn": true,
-          "payload": { /* terminal meter sample — metric, entity, value, 
timeBucket */ } }
+          "payload": {
+            "metric": "e2e_demo_filtered_requests",
+            "entity": "MeterEntity(scopeType=SERVICE, serviceName=my-svc, …)",
+            "valueType": "sum",
+            "timeBucket": 202605091036,
+            "value": 42                /* shape depends on valueType:
+                                          number for 
Sum/Avg/Max/Min/CPM/Latest…,
+                                          object {bucket: count} for 
histograms /
+                                          *Labeled functions, omitted for 
non-scalar
+                                          holders. NaN/±Infinity render as 
strings. */
+          } }
       ]
     }]
   }]
diff --git a/docs/en/swip/SWIP-13.md b/docs/en/swip/SWIP-13.md
index ec22fd5c5e..2ad5c796ce 100644
--- a/docs/en/swip/SWIP-13.md
+++ b/docs/en/swip/SWIP-13.md
@@ -191,7 +191,7 @@ object but hasn't yet entered the streaming/aggregation 
kernel.
 `kind`, `component`) and in a multi-tenant or multi-component flow a
 rule routinely rejects 99% of the traffic routed to it by metric
 name. Publishing every rejected execution would burn `recordCap`
-(default 1000) on noise within seconds and never reach a row that
+(default 100) on noise within seconds and never reach a row that
 demonstrates the rule's actual processing. So for MAL the contract
 is: **every record represents one `SampleFamily` that passed the
 rule's filter and walked through to `meterEmit`**. Implementation:
@@ -2434,7 +2434,7 @@ disappearance.
 | Cap                                  | Default                              
| Mechanism                                                              |
 
|--------------------------------------|--------------------------------------|------------------------------------------------------------------------|
 | **Max active sessions per node**     | **200**                              
| hard ceiling; `POST /dsl-debugging/session` returns 429 too_many_sessions 
when full |
-| **Records per session**              | **default 1000, hard cap 10000**     
| recorder stops appending once the count is hit and moves the session to 
CAPTURED. Out-of-range request returns 400 invalid_limits. |
+| **Records per session**              | **default 100, hard cap 100**        
| recorder stops appending once the count is hit and moves the session to 
CAPTURED. Out-of-range request returns 400 invalid_limits. The cap is 
deliberately small — operators inspect a handful of executions, not a paginated 
firehose; per-session heap stays at ~1 MiB and the rendered UI page stays 
readable. |
 | Capture window (retention)           | default 5 min, hard cap 1 hour       
| per-session retention timeout; sweeper drops the payload at window end |
 | Capture-call cost when idle          | one volatile-load + branch (gate)    
| JIT eliminates the call site when gate is false                        |
 
@@ -2452,19 +2452,19 @@ captures.
 #### Why 200 active sessions is the right ceiling
 
 200 sessions is a defensible upper bound for an OAP node. With the
-hard `recordCap = 10000` and a typical per-record JSON payload of
+hard `recordCap = 100` and a typical per-record JSON payload of
 ~10 KiB (MAL is the largest of the three), the worst-case footprint
 across all active sessions on one node is roughly:
 
 ```
 worst-case heap ≈ MAX_ACTIVE_SESSIONS × MAX_RECORD_CAP × per-record-bytes
-                ≈ 200 × 10 000 × 10 KiB ≈ 20 GiB (theoretical)
+                ≈ 200 × 100 × 10 KiB ≈ 200 MiB (theoretical)
 ```
 
-That's the worst-case product; realistic usage is one to two orders of
-magnitude below because most captures stop on retention (5 min default)
-or operator stop long before hitting `recordCap`, and most records are
-several KiB rather than the worst-case 10. The 200 default leaves
+That's the worst-case product; realistic usage is well below it
+because most captures stop on retention (5 min default) or operator
+stop long before hitting `recordCap`, and most records are several
+KiB rather than the worst-case 10. The 200 default leaves
 headroom for many concurrent operators (one debug context = one
 sessionId; the UI maintains a single session per debug widget and
 reuses it for polls) without letting a runaway script exhaust the
@@ -2591,9 +2591,10 @@ knobs — they are hard-coded SWIP contract values:
   receiving node also runs the prior-session cleanup pass for the
   supplied `clientId` BEFORE counting toward this ceiling, so a single
   client repeatedly clicking Start sampling cannot itself trigger 429.
-- `MAX_RECORD_CAP = 10000` per session — request bodies asking for more
-  return `400 invalid_limits`. Session retention is similarly capped
-  at 1 hour (`MAX_RETENTION_MILLIS`).
+- `MAX_RECORD_CAP = 100` per session — request bodies asking for more
+  return `400 invalid_limits`. The default also resolves to 100 (capped
+  by the hard ceiling). Session retention is similarly capped at 1 hour
+  (`MAX_RETENTION_MILLIS`).
 
 There is intentionally **no per-session byte cap and no structural
 char / sample / label sub-caps**. Operators sizing for memory pressure
diff --git 
a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/mal/MALDebugRecorderImpl.java
 
b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/mal/MALDebugRecorderImpl.java
index 800362c2c1..efd15c0a15 100644
--- 
a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/mal/MALDebugRecorderImpl.java
+++ 
b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/mal/MALDebugRecorderImpl.java
@@ -28,6 +28,12 @@ import 
org.apache.skywalking.oap.server.admin.dsl.debugging.session.SessionLimit
 import org.apache.skywalking.oap.server.core.analysis.meter.MeterEntity;
 import 
org.apache.skywalking.oap.server.core.analysis.meter.function.AcceptableValue;
 import 
org.apache.skywalking.oap.server.core.analysis.meter.function.MeterFunction;
+import org.apache.skywalking.oap.server.core.analysis.metrics.DataTable;
+import 
org.apache.skywalking.oap.server.core.analysis.metrics.DoubleValueHolder;
+import org.apache.skywalking.oap.server.core.analysis.metrics.IntValueHolder;
+import 
org.apache.skywalking.oap.server.core.analysis.metrics.LabeledValueHolder;
+import org.apache.skywalking.oap.server.core.analysis.metrics.LongValueHolder;
+import org.apache.skywalking.oap.server.core.analysis.metrics.Metrics;
 import org.apache.skywalking.oap.server.core.dsldebug.GateHolder;
 import org.apache.skywalking.oap.server.core.dsldebug.RuleKey;
 
@@ -185,9 +191,100 @@ public final class MALDebugRecorderImpl extends 
AbstractDebugRecorder
         obj.addProperty("entity", entity == null ? null : entity.toString());
         obj.addProperty("valueType", resolveFunctionName(value));
         obj.addProperty("timeBucket", timeBucket);
+        appendValue(obj, value);
         return obj.toString();
     }
 
+    /**
+     * Surface the metric's terminal reading on the captured {@code output}
+     * sample so the operator sees the actual MAL emission, not just the
+     * function name. The shape of {@code "value"} depends on the holder
+     * the generated function class implements:
+     * <ul>
+     *   <li>{@link LongValueHolder} / {@link IntValueHolder} → JSON number
+     *       (Sum, Avg, Max, Min, Latest, SumPerMin, …).</li>
+     *   <li>{@link DoubleValueHolder} → JSON number; non-finite values
+     *       (NaN, ±Infinity) render as the corresponding string so the
+     *       wire stays valid JSON and the reading is still visible.</li>
+     *   <li>{@link LabeledValueHolder} → JSON object {@code {key: long}}
+     *       — labeled metrics ({@code *Labeled}) and histogram/percentile
+     *       functions ({@code AvgHistogramPercentileFunction},
+     *       {@code SumHistogramPercentileFunction}) whose reading is a
+     *       {@link DataTable}. Keys are label combos for {@code *Labeled},
+     *       {@code p=<rank>} entries for percentile functions.</li>
+     * </ul>
+     * If {@code value} is null or not one of the recognised holders the
+     * field is omitted; the operator still sees {@code valueType} and can
+     * tell from the function name that the shape is non-scalar.
+     *
+     * <p><b>Two-phase functions.</b> Some functions split work between
+     * {@code accept()} and {@code calculate()} — accept() populates raw
+     * aggregates (e.g. {@code summation} + {@code count} for histogram-
+     * percentile), and calculate() turns those into the user-visible
+     * field returned by {@code getValue()} (e.g. {@code percentileValues}).
+     * The MAL {@code captureMeterEmit} probe fires AFTER accept() but
+     * BEFORE the streaming pipeline calls calculate(), so without forcing
+     * calculate() here the labeled value column would be an empty map for
+     * histogram-percentile rules — exactly when operators most need to
+     * verify what the rule emits. We force calculate() at probe time so
+     * the captured value matches what the storage row will contain. This
+     * is safe and zero-cost when debug is off:
+     * <ul>
+     *   <li>The probe site itself is gated; when no operator has installed
+     *       a session the call into this method is dead code that the JIT
+     *       elides on hot-path inlining.</li>
+     *   <li>{@code calculate()} is idempotent via the function's
+     *       {@code isCalculated} guard, so the streaming pipeline's later
+     *       call is a no-op rather than double work.</li>
+     *   <li>Cross-node {@code combine()} resets {@code isCalculated=false},
+     *       so the post-merge calculate on combined state still happens
+     *       on read — pre-computing here on the local snapshot doesn't
+     *       leak stale values into the cluster.</li>
+     * </ul>
+     */
+    private static void appendValue(final JsonObject obj, final 
AcceptableValue<?> value) {
+        if (value == null) {
+            return;
+        }
+        // Force two-phase functions to compute their user-visible reading
+        // before we read getValue(). See the javadoc above for why this is
+        // safe and only paid when a debug session is active.
+        if (value instanceof Metrics) {
+            ((Metrics) value).calculate();
+        }
+        // Order matters: LabeledValueHolder is checked before scalar holders
+        // because some labeled functions could in principle implement both;
+        // the labeled (DataTable) view is the operator-meaningful one.
+        if (value instanceof LabeledValueHolder) {
+            final DataTable table = ((LabeledValueHolder) value).getValue();
+            if (table == null) {
+                return;
+            }
+            final JsonObject map = new JsonObject();
+            for (final String key : table.keys()) {
+                final Long v = table.get(key);
+                if (v != null) {
+                    map.addProperty(key, v);
+                }
+            }
+            obj.add("value", map);
+        } else if (value instanceof LongValueHolder) {
+            obj.addProperty("value", ((LongValueHolder) value).getValue());
+        } else if (value instanceof IntValueHolder) {
+            obj.addProperty("value", ((IntValueHolder) value).getValue());
+        } else if (value instanceof DoubleValueHolder) {
+            final double v = ((DoubleValueHolder) value).getValue();
+            if (Double.isFinite(v)) {
+                obj.addProperty("value", v);
+            } else {
+                // Gson rejects NaN / ±Infinity as numbers — surface them as
+                // strings so an operator inspecting a divide-by-zero or
+                // empty-window emit can still see the actual reading.
+                obj.addProperty("value", Double.toString(v));
+            }
+        }
+    }
+
     /**
      * Resolve the human-readable MAL function name (e.g. {@code sum},
      * {@code avg}, {@code maxHistogram}) by walking the class hierarchy
diff --git 
a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/AbstractDebugRecorder.java
 
b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/AbstractDebugRecorder.java
index 41e741e836..3014171a47 100644
--- 
a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/AbstractDebugRecorder.java
+++ 
b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/AbstractDebugRecorder.java
@@ -167,7 +167,7 @@ public abstract class AbstractDebugRecorder implements 
DebugRecorder {
      * predicate for OAL) decides whether the rule actually processes it.
      * On tag-discriminating rules the rejection ratio can hit 99% — most
      * traffic isn't relevant to the rule. If we published rejected
-     * executions, recordCap (default 1000) would fill with garbage before
+     * executions, recordCap (default 100) would fill with garbage before
      * any meaningful "what did the rule actually do" record showed up. So we
      * publish only kept executions: every record in {@code records[]}
      * represents one source/family that survived the rule's filter and
diff --git 
a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/SessionLimits.java
 
b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/SessionLimits.java
index 6920d7abc1..1cdaf6c7f6 100644
--- 
a/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/SessionLimits.java
+++ 
b/oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/SessionLimits.java
@@ -40,12 +40,13 @@ public final class SessionLimits {
 
     /**
      * Hard upper bound on per-session retained records. The recorder appends
-     * one record per probe pass and stops once the count reaches the cap;
-     * payloads can be ~10 KiB for richly-tagged MAL/LAL flows, so 10k records
-     * caps the per-session heap footprint at ~100 MiB even before the
-     * captured-at-cap auto-detach kicks in.
+     * one record per probe pass and stops once the count reaches the cap.
+     * Payloads can be ~10 KiB for richly-tagged MAL/LAL flows, so 100 records
+     * caps the per-session heap footprint at ~1 MiB and keeps the rendered
+     * UI page readable — operators inspect a handful of executions, not a
+     * paginated firehose, so a small cap is the right product shape.
      */
-    public static final int MAX_RECORD_CAP = 10_000;
+    public static final int MAX_RECORD_CAP = 100;
 
     /**
      * Hard upper bound on the per-session retention window (1 hour). Sessions
@@ -55,7 +56,7 @@ public final class SessionLimits {
     public static final long MAX_RETENTION_MILLIS = 60L * 60 * 1000;
 
     public static final SessionLimits DEFAULT =
-        new SessionLimits(1_000, 5L * 60 * 1000, Granularity.DEFAULT);
+        new SessionLimits(MAX_RECORD_CAP, 5L * 60 * 1000, Granularity.DEFAULT);
 
     private final int recordCap;
     private final long retentionMillis;
diff --git 
a/oap-server/server-admin/dsl-debugging/src/test/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/DebugSessionRegistryTest.java
 
b/oap-server/server-admin/dsl-debugging/src/test/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/DebugSessionRegistryTest.java
index b6941fcb1a..bb9ea1763d 100644
--- 
a/oap-server/server-admin/dsl-debugging/src/test/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/DebugSessionRegistryTest.java
+++ 
b/oap-server/server-admin/dsl-debugging/src/test/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/DebugSessionRegistryTest.java
@@ -186,7 +186,7 @@ public class DebugSessionRegistryTest {
         final RuleKey key = new RuleKey(Catalog.OTEL_RULES, "vm", "cpu");
         final GateHolder holder = new GateHolder("hash-1");
         final DebugSessionRegistry registry = registryWith(holder, key);
-        final SessionLimits shortRetention = new SessionLimits(1_000, 1L);
+        final SessionLimits shortRetention = new 
SessionLimits(SessionLimits.MAX_RECORD_CAP, 1L);
         registry.install(key, "client-a", shortRetention);
         Thread.sleep(5);
 
diff --git a/test/e2e-v2/cases/dsl-debugging/mal/dsl-debug-flow.sh 
b/test/e2e-v2/cases/dsl-debugging/mal/dsl-debug-flow.sh
index f4e8f23c27..267234d760 100755
--- a/test/e2e-v2/cases/dsl-debugging/mal/dsl-debug-flow.sh
+++ b/test/e2e-v2/cases/dsl-debugging/mal/dsl-debug-flow.sh
@@ -199,6 +199,21 @@ last_with_metric="$(echo "${collect_body}" | jq --arg n 
"${METRIC_NAME}" \
 [ "${last_with_metric}" -gt 0 ] \
     || fail "no execution closes with a meterEmit sample carrying 
payload.metric=${METRIC_NAME}"
 
+# meterEmit must surface the actual reading on payload.value. The seed rule
+# expression is `request_count + pool_size + decoy.sum(['service_name'])` —
+# all three operands are LongValueHolder Sum/Avg outputs, so the reading
+# resolves to a JSON number (not a labeled DataTable, not NaN). Asserting
+# the field is present and numeric proves operators see the actual MAL
+# emission, not just the function name.
+missing_value="$(echo "${collect_body}" | jq --arg n "${METRIC_NAME}" \
+    '[.nodes[].records[] | .samples[-1]
+      | select(.payload.metric == $n)
+      | select(has("payload"))
+      | select((.payload.value == null) or ((.payload.value | type) != 
"number"))]
+      | length')"
+[ "${missing_value}" = "0" ] \
+    || fail "${missing_value} terminal meterEmit sample(s) missing numeric 
payload.value"
+
 log "✓ MAL shape valid (${records_count} records, ${total_samples} samples)"
 
 # --- Phase 6: stop session 
------------------------------------------------------------

Reply via email to