andrewmusselman opened a new issue, #25:
URL: https://github.com/apache/tooling-agents/issues/25
# Audit produces zero Critical findings when `severityThreshold` is set;
move filtering to consolidate phase
## Summary
When the orchestrator runs with `severityThreshold=HIGH`, the audit phase
(Opus) emits **zero Critical findings** across every section, even for findings
whose own descriptions explicitly classify themselves as Type B/C/D control
gaps — which the audit rubric defines as Critical. Findings that should be
Critical instead land at HIGH.
Two recent Apache Airflow audits illustrate:
| Run | Raw findings | HIGH | CRITICAL |
|---|---|---|---|
| `airflow-core` @ 7ca4c75 | 25 | 25 | 0 |
| `task-sdk` @ 7ca4c75 | 46 | 46 | 0 |
Across **45 sections** that produced findings, the audit emitted Finding IDs
of the form `ASVS-XXX-HIGH-NNN` exclusively — never `CRIT`, never `MED`, never
`LOW`. The model is locked onto HIGH.
## Evidence
CouchDB inspection of the per-section reports (raw audit output, before
consolidate touches anything) confirms this is an **audit-phase** issue, not a
consolidate-phase downgrade. Examples where the model's own description names a
Type B/C gap but the finding is marked HIGH:
- `airflow-core` finding on multi-team authorization: *"the control EXISTS
(`is_authorized_pool`) but is NOT CALLED when the teams set is empty"* —
textbook Type B, marked HIGH.
- `task-sdk` finding on deserialization allow-list: *"Type B gap where a
security control EXISTS (allow-list with regex matching) but is NOT correctly
applied"* — Type B, marked HIGH.
- `task-sdk` finding on auth backend fallback: *"Type C gap where the
control is called (server-side authorization checks token scope) but the result
is ignored"* — Type C, marked HIGH.
## Root Cause
The `if severity_threshold:` block in `asvs_audit.py` (~line 785) injects
this text into the Opus system prompt:
```
## Severity Threshold
Only report findings at these severity levels: CRITICAL, HIGH.
Do not include findings below HIGH severity.
```
The word "HIGH" appears prominently (twice in two consecutive lines, plus as
the threshold value itself). The model treats this as a strong anchor — HIGH
becomes the operating zone — and the gap-type rules elsewhere in the prompt
that say "Type B/C/D = CRITICAL" lose to that anchoring.
This was confirmed by ruling out alternatives:
- Not a consolidate-phase downgrade: raw per-section reports already have 0
Critical before consolidate runs.
- Not a parser bug: 22/22 and 23/23 reports with findings parsed cleanly via
the `ASVS-XXX-{SEV}-NNN` format strategy. The model is producing structured
Finding IDs — just only ever with the HIGH token.
- Not a redaction artifact: counts are 0 in both private
(`tooling-runbooks`) and public (`tooling-agents`) consolidated.md.
## Proposed Fix: Move severity threshold from audit to consolidate
**Rationale.** Audit (Opus) calls are the expensive part of the pipeline.
Their output should be maximally reusable across different rendering choices.
Pre-filtering by severity at audit time:
- causes the prompt-anchoring bug above,
- locks the audit cache to one threshold value (re-running with a different
threshold means re-auditing all sections), and
- couples audit-time computation to consumer-side display choices that don't
need to be coupled.
Filtering at consolidate time:
- removes the anchoring problem entirely (audit no longer sees the threshold
word),
- makes audit results threshold-independent (re-render at any threshold
without re-auditing — large speedup for iteration), and
- centralizes display-policy decisions where they belong.
The CouchDB refactor was the change that made this design viable —
per-section reports are now cheap to store fully, so we no longer need
audit-time pre-filtering as a noise-reduction trick.
## Implementation
### 1. `asvs_audit.py` — remove threshold prompt block
Delete lines ~785–790:
```python
if severity_threshold:
severity_levels = {"CRITICAL": 4, "HIGH": 3, "MEDIUM": 2, "LOW": 1}
threshold_val = severity_levels.get(severity_threshold.upper(), 0)
if threshold_val > 0:
included = [k for k, v in severity_levels.items() if v >=
threshold_val]
analysis_system_prompt += f"\n## Severity Threshold\n..."
```
The `severityThreshold` input parameter can stay declared on the agent for
orchestrator interface compatibility but becomes unused inside audit.
### 2. `asvs_consolidate.py` — apply filter after Phase 4 merge
After the existing `all_findings.sort(...)` (around line 957), before global
ID assignment:
```python
if severity_threshold:
severity_levels = {"CRITICAL": 4, "HIGH": 3, "MEDIUM": 2, "LOW": 1,
"INFORMATIONAL": 0}
threshold_val = severity_levels.get(severity_threshold.upper(), 0)
pre_filter = len(all_findings)
all_findings = [
f for f in all_findings
if severity_levels.get(f.get("severity", "Informational").upper(),
0) >= threshold_val
]
print(f"Severity filter ({severity_threshold} and above): {pre_filter} →
{len(all_findings)}")
```
The metadata table in consolidated.md already displays `Severity Threshold`
correctly — no template change needed.
### 3. Gap-type rubric improvements
Independent of this issue, the audit prompt was also strengthened in this
branch with concrete Type B/C/D examples and a final self-check pass. Keep
those — they're defense-in-depth and improve audit quality regardless of where
the threshold lives. The "leave it as HIGH" guard in the self-check pass should
be reworded to "keep whatever severity you assigned during initial drafting" to
allow legitimately-Critical Type A findings (e.g., no auth on an admin
endpoint) to remain Critical.
## Validation Plan
1. Re-run one Airflow module with `clearCache=true` and
`severityThreshold=HIGH`.
2. Run `inspect_audit_findings.py` against the resulting CouchDB namespace —
Critical findings should now appear in sections involving Type B/C/D gaps
(deserialization bypass, encryption-disabled fallback, authorization bypass via
parse failure, etc.).
3. Verify `consolidated.md` filters down to HIGH+ when threshold set.
4. Verify `consolidated.md` includes all severities when threshold not set.
5. Compare consolidated counts before and after — expect Critical count to
be non-zero on at least one of the two modules.
## Side Benefits
- Per-section CouchDB reports become full audit data, reusable across any
rendering.
- Re-running consolidate at a different threshold is now cheap — no Opus
calls, just Sonnet for synthesis (cached on input hash).
- Removes a whole class of prompt-anchoring bug (any future "X Threshold"
instructions for level/scope/severity that anchor on prominent words).
## Related Follow-ups (separate issues)
- **Audit cache invalidation:** the cache key `f"batch-{i}"` doesn't
incorporate prompt content, so prompt edits don't bust the cache. Adding
`hashlib.sha256(analysis_system_prompt.encode()).hexdigest()[:8]` to the key
would fix it.
- **Empty-bundle stubs in CouchDB:** bundles that find zero matching files
currently write `Error: No files found in namespaces ...` stubs to the reports
namespace. One was observed at `xml_parsing/1.5.1.md` in the task-sdk run.
Bundle agent should log-and-return without writing a CouchDB key for empty
results.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]