andrewmusselman opened a new issue, #16:
URL: https://github.com/apache/tooling-agents/issues/16
## Summary
When `asvs_orchestrate` re-audits the same commit but `asvs_discover`
produces a different domain partition than the previous run, per-section
reports end up duplicated across multiple domain folders. The consolidated
report is correct (consolidate only reads from the current run's domain dirs),
but stale `.md` files from the previous run's domain assignment remain in the
repo indefinitely. Nothing in the pipeline lists or deletes them.
## Steps to reproduce
1. Run `asvs_orchestrate` against `apache/mahout` at commit `245aad3` with
`discover=true`. Suppose discovery assigns ASVS section 6.1.1 to the
`auth_identity` domain. The pipeline pushes
`ASVS/reports/mahout/245aad3/auth_identity/6.1.1.md`.
2. Re-run `asvs_orchestrate` against the same commit. Discovery is
non-deterministic (Sonnet temperature 0.7) and this time decides 6.1.1 belongs
in `session_management`. The pipeline pushes
`ASVS/reports/mahout/245aad3/session_management/6.1.1.md`.
3. Inspect the repo — both files now exist:
- `ASVS/reports/mahout/245aad3/auth_identity/6.1.1.md` (stale, from run 1)
- `ASVS/reports/mahout/245aad3/session_management/6.1.1.md` (current,
from run 2)
## Why this happens
`asvs_push_github` overwrites individual files in place (GET sha → PUT with
sha = update). It only touches paths it's explicitly told about. Nothing in the
pipeline lists a directory and deletes orphans.
`asvs_orchestrate` builds the consolidator's `report_directories` list from
the **current** run's discovery output:
```python
# asvs_orchestrate.py
seen_pass_dirs = set()
for pass_def, _ in work_items:
pn = pass_def.get("name", "unknown")
d = f"{push_directory}/{pn}" if push_directory else pn
if d not in seen_pass_dirs:
report_directories.append(d)
seen_pass_dirs.add(d)
```
So `asvs_consolidate` reads only from the current domain dirs and produces a
correct `consolidated.md`. The stale file is invisible to consolidation but
remains in the repo.
## Impact
- **Repo bloat.** Each re-audit of the same commit can accumulate orphan
files in old domain folders. Over many reruns, the run directory grows
arbitrarily large.
- **QA tooling false negatives.** The README's "check for missing reports"
snippet lists every `.md` file in the run directory and matches by section ID.
With duplicates across two domain folders, both register as "present" — even if
one is stale and contradicts the consolidated report.
- **Confusion when reading per-section reports directly.** A user clicking
through GitHub to read a per-section finding may land on the orphan and read a
stale audit, with no indication that a newer version exists in a different
folder.
- **Doesn't break consolidation today**, but it's a quiet correctness
hazard. If the orphan happens to contain different findings (different prompt,
different Opus reasoning trace, different domain context), the per-section file
disagrees with the consolidated report.
## Proposed fix
Add a new orchestrator parameter `cleanStaleReports` (default `"false"` for
safety; opt-in until proven). When `"true"`, after the audit phase but before
consolidation, the orchestrator:
1. Builds the set of expected paths under `outputDirectory/<repo>/<commit>/`
from the current run's `work_items`.
2. Lists the run directory on GitHub via the Trees API (`GET
/repos/{owner}/{repo}/git/trees/{sha}?recursive=1`, filtered to the run path).
3. Identifies any `.md` file under a domain folder whose path isn't in the
expected set AND whose filename matches `^\d+\.\d+\.\d+\.md$` (so we never
accidentally delete `consolidated.md`, `issues.md`, README files, or anything
outside the section-report convention).
4. Deletes each orphan via `DELETE /repos/{owner}/{repo}/contents/{path}`
with the file's SHA.
5. Logs a summary: how many orphans were found and deleted.
If the run aborted or was partial, `cleanStaleReports=true` should be
skipped automatically — only run cleanup after a successful audit phase.
(Ideally we should never delete based on a partial work_items list, since a
missing section in this run could just mean the audit failed and shouldn't
trigger deletion of last run's good copy.)
For the carve-out flow, cleanup runs against **both** repos: the private
repo (full reports) and the public repo (redacted reports), each with its own
token.
### Behavior matrix
| `cleanStaleReports` | Audit phase status | Action |
|---|---|---|
| `"false"` (default) | any | No cleanup. Existing behavior. Orphans
accumulate. |
| `"true"` | All work items succeeded | Delete orphans matching
`\d+\.\d+\.\d+\.md` not in expected set |
| `"true"` | Partial failures | Skip cleanup; print warning. Orphans
preserved (they may correspond to sections the audit failed to re-audit) |
| `"true"` | All failed | Skip cleanup |
### What is **not** deleted
- `consolidated.md`, `issues.md`, and any `consolidated-*.md` /
`issues-*.md` variants
- Anything not matching the strict `\d+\.\d+\.\d+\.md` section-report pattern
- Anything outside `outputDirectory/<repo>/<commit>/`
- Anything in subdirectories named `rerun/` (preserve manual rerun results —
these are intentionally kept across runs per the README)
- The `<commit>/` directory itself
### Why opt-in
Defaulting to `"false"` lets us roll out the change without surprising
existing automation. Once it's been exercised against a few real runs and we're
confident in the safety bounds, we can flip the default to `"true"` in a
follow-up.
## Alternatives considered
**Always delete the entire run directory before pushing new files.** Brute
force; clean. Downside: if the new audit fails partway through, you've lost the
previous good run with nothing to fall back on. The opt-in surgical approach
above preserves the previous run as a safety net until the new run completes
successfully.
**Detect and warn only.** Add a check that lists orphans and prints a
warning, but doesn't delete them. Punts the problem to the user. Useful as a
debugging aid; not a fix.
**Make discovery deterministic.** Lower discovery temperature to 0, or cache
the discovery output keyed by commit hash so re-audits get the same partition.
Helps for re-audits but not for legitimate "I tweaked the discovery prompt and
want to re-run" cases. And doesn't address the core issue that the pipeline is
silent about leftover files.
## Acceptance criteria
- [ ] New `cleanStaleReports` parameter on `asvs_orchestrate` (string,
defaults to `"false"`)
- [ ] When enabled and the audit phase fully succeeded, orphan
section-report files (matching `\d+\.\d+\.\d+\.md`) are deleted from the run
directory
- [ ] `consolidated.md`, `issues.md`, the `<commit>/` dir itself, and
`rerun/` subdirs are never deleted
- [ ] Cleanup is skipped (with a logged warning) when audit phase had
failures
- [ ] In carve-out mode, cleanup applies to both private and public repos
using their respective tokens
- [ ] Cleanup logs a summary line: `Cleanup: deleted N orphan section
reports from <repo>/<dir>`
- [ ] Tested against a re-audit of `apache/mahout` at the same commit with
discovery seeded to produce a different partition; verify orphans are removed
and `consolidated.md` content is unaffected
- [ ] README updated: new param documented in the orchestrate input table;
new troubleshooting note about discovery non-determinism and how to enable
cleanup
## Out of scope for this issue
- Cross-commit cleanup (e.g., deleting old `<commit_hash>/` directories that
are no longer interesting). Different problem, different policy.
- Cleanup of stale entries in the `audit-cache:*` data store namespaces.
Unrelated; those caches are content-derived (file-set hash) and self-invalidate
on real changes.
- Making discovery deterministic. Worth its own discussion, but doesn't
substitute for cleanup.
## Related
- The existing `--consolidate-only` mode in `rerun-sections.sh` doesn't
trigger this issue because it doesn't re-audit anything; it only reads existing
per-section reports. Cleanup should NOT run in `--consolidate-only` mode either.
- README QA section: the "check for missing reports" snippet should be
updated once cleanup ships, since it currently has no way to detect duplicates.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]