D gaps when rubric mandates Critical (tooling-agents)

via GitHub Fri, 08 May 2026 12:50:08 -0700


andrewmusselman opened a new issue, #29:
URL: https://github.com/apache/tooling-agents/issues/29


   ## Summary
   
   The Opus audit prompt now produces a healthy distribution of severities 
across all four levels (this fixed an earlier issue where every finding was 
marked HIGH). However, the model still systematically downgrades some Type 
B/C/D control gaps to High or Medium, despite the prompt's explicit gap-type 
rubric mandating Critical for these gap types.
   
   ## Evidence
   
   `task-sdk/95bbf6a/consolidated.md` (run with the updated prompt) — 
`FINDING-018`:
   
   > **Severity:** Medium
   > *Silent fallback to `_NullFernet` when key missing... the system silently 
degrades to a no-op cipher that stores secrets in plaintext. No warning is 
surfaced to operators, violating the principle of secure-by-default.*
   
   This is a textbook Type B gap: the encryption control exists (`_RealFernet` 
does encrypt-then-MAC correctly when keyed), but is silently bypassed in a 
foreseeable configuration state. The current rubric in `asvs_audit.py` says:
   
   > **Type B — Control EXISTS but is NOT CALLED at this entry point**
   > Severity: **CRITICAL** (false confidence — operators believe protected, 
aren't)
   
   Yet this finding is marked Medium. The model's description correctly 
identifies the secure-by-default violation; it's the severity assignment 
specifically that's drifting.
   
   ## Hypotheses
   
   In rough order of likelihood:
   
   1. **Real-world impact priors override the rubric.** Models reserve 
"Critical" for visceral catastrophic outcomes (RCE, mass data exfil) regardless 
of what a written rubric says. Type B/C/D gaps that don't trigger a visceral 
reaction get downgraded by the model's own intuition. The rubric is read but 
not enforced.
   
   2. **The final-pass self-check is treated as optional.** The prompt's "walk 
back through every High and re-classify if Type B/C/D" step is the kind of 
self-review LLMs often perform perfunctorily, especially when it would mean 
reversing their initial confident output.
   
   3. **Within-batch relative calibration.** When evaluating multiple findings 
in one Opus call, the model spreads severities to look balanced rather than 
applying the rubric absolutely. If a batch has 10 findings, the model produces 
~1 Critical, 2-3 High, 4-5 Medium, 1-2 Low even when the rubric would say 6 of 
them are Critical.
   
   All three are manifestations of "rubrics are advisory, not enforceable" with 
current LLM behavior.
   
   ## Proposed Fixes (ordered by effort)
   
   ### Fix A: Gap-type-first output format (cheap, try this first)
   
   Restructure the per-finding output format so the model writes `gap_type` 
*before* `severity`, with severity rendered as a deterministic lookup from gap 
type:
   
   ```
   For each finding, output the fields in this exact order:
   
     Gap type: [A | B | C | D]
     Severity: (look up from gap type, do NOT exercise discretion):
               - Type A → severity per impact (Critical/High/Medium/Low)
               - Type B → CRITICAL
               - Type C → CRITICAL
               - Type D → CRITICAL
     Finding ID: ASVS-XXX-{CRIT|HIGH|MED|LOW}-NNN
     ...rest of finding...
   ```
   
   Forcing function: once the model commits to "Type B" on the gap_type line, 
writing anything other than CRITICAL on the very next line is harder than 
writing it directly without the forcing context. This is the cheapest 
experiment to run — pure prompt change, no code structure changes.
   
   ### Fix B: Separate severity-assignment pass (medium effort)
   
   After the Opus audit produces findings (with gap_type tags), make a small 
Sonnet call that walks every finding and overrides severity per the rubric:
   
   ```
   Input: list of findings with (id, gap_type, description)
   Task: for each finding, output the rubric-mandated severity.
          Type A: severity per impact, you decide.
          Type B/C/D: ALWAYS Critical, no exceptions.
   Output: JSON list of { id, severity }.
   ```
   
   Then post-process: parse the override list, walk the markdown, swap 
severities (and Finding ID severity tokens) accordingly.
   
   Advantages over Fix A: severity-assignment runs in a fresh context with one 
job. The model isn't asked to override its own prior judgment — it's asked to 
apply a simple rule on data it didn't produce. This decouples analytical 
reasoning from severity calibration. Sonnet is cheap and fast; cost overhead is 
negligible against the Opus call.
   
   ### Fix C: Structured output with deterministic severity (high effort)
   
   Switch audit output from markdown to JSON with a schema where:
   - `gap_type` is required
   - `severity` is **not** in the schema — computed deterministically post-call 
from `gap_type`
   
   This bypasses model judgment for severity entirely. Most robust outcome but 
requires:
   - Audit agent rewrite to emit JSON
   - Consolidate Phase 1 rewrite to read JSON
   - Format-pass output (Sonnet markdown rendering) reworked
   - Parser updates throughout
   
   Bigger change. Defer unless A and B both fail.
   
   ## Recommended Order
   
   1. Try **Fix A** first. One prompt edit, cheap to validate. If 
FINDING-018-style cases get classified Critical on the next re-run, done.
   2. If Fix A doesn't move the needle on a re-audit (after the cache-key fix 
lands so prompt edits actually take effect), escalate to **Fix B**.
   3. **Fix C** stays in reserve unless calibration drift becomes a recurring 
problem.
   
   ## Validation
   
   1. Re-run airflow-core at L3 with the same source as the validation run from 
this session.
   2. Inspect findings via `inspect_audit_findings.py` — should see Critical 
count *increase* relative to the 23-Critical baseline, since Fix A targets 
cases that previously fell to Medium/High despite Type B/C classification.
   3. Spot-check specific known-Type-B cases:
      - Silent encryption fallback (`_NullFernet` style)
      - Authorization bypass via suppressed exception
      - Token revocation list never enforced during validation
      - Whatever the next analogous case is in the source under audit
   4. Each of these should land at Critical, not Medium/High.
   
   ## Dependencies
   
   - This work is gated on the [audit cache key 
fix](./issue_audit_cache_key.md) — without it, prompt edits don't take effect 
on `clearCache=false` runs and validation requires expensive `clearCache=true`.
   
   ## Related
   
   - Prior issue: severity threshold prompt block was the dominant cause of the 
HIGH lock-on. That's resolved.
   - Earlier session work: audit prompt now has explicit gap-type table with 
examples and a final-pass self-check. This issue addresses the residual drift 
remaining after those edits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Audit calibration: model under-classifies Type B/C/D gaps when rubric mandates Critical (tooling-agents)

Reply via email to