andrewmusselman opened a new issue, #16:
URL: https://github.com/apache/tooling-agents/issues/16

   ## Summary
   
   When `asvs_orchestrate` re-audits the same commit but `asvs_discover` 
produces a different domain partition than the previous run, per-section 
reports end up duplicated across multiple domain folders. The consolidated 
report is correct (consolidate only reads from the current run's domain dirs), 
but stale `.md` files from the previous run's domain assignment remain in the 
repo indefinitely. Nothing in the pipeline lists or deletes them.
   
   ## Steps to reproduce
   
   1. Run `asvs_orchestrate` against `apache/mahout` at commit `245aad3` with 
`discover=true`. Suppose discovery assigns ASVS section 6.1.1 to the 
`auth_identity` domain. The pipeline pushes 
`ASVS/reports/mahout/245aad3/auth_identity/6.1.1.md`.
   2. Re-run `asvs_orchestrate` against the same commit. Discovery is 
non-deterministic (Sonnet temperature 0.7) and this time decides 6.1.1 belongs 
in `session_management`. The pipeline pushes 
`ASVS/reports/mahout/245aad3/session_management/6.1.1.md`.
   3. Inspect the repo — both files now exist:
      - `ASVS/reports/mahout/245aad3/auth_identity/6.1.1.md` (stale, from run 1)
      - `ASVS/reports/mahout/245aad3/session_management/6.1.1.md` (current, 
from run 2)
   
   ## Why this happens
   
   `asvs_push_github` overwrites individual files in place (GET sha → PUT with 
sha = update). It only touches paths it's explicitly told about. Nothing in the 
pipeline lists a directory and deletes orphans.
   
   `asvs_orchestrate` builds the consolidator's `report_directories` list from 
the **current** run's discovery output:
   
   ```python
   # asvs_orchestrate.py
   seen_pass_dirs = set()
   for pass_def, _ in work_items:
       pn = pass_def.get("name", "unknown")
       d = f"{push_directory}/{pn}" if push_directory else pn
       if d not in seen_pass_dirs:
           report_directories.append(d)
           seen_pass_dirs.add(d)
   ```
   
   So `asvs_consolidate` reads only from the current domain dirs and produces a 
correct `consolidated.md`. The stale file is invisible to consolidation but 
remains in the repo.
   
   ## Impact
   
   - **Repo bloat.** Each re-audit of the same commit can accumulate orphan 
files in old domain folders. Over many reruns, the run directory grows 
arbitrarily large.
   - **QA tooling false negatives.** The README's "check for missing reports" 
snippet lists every `.md` file in the run directory and matches by section ID. 
With duplicates across two domain folders, both register as "present" — even if 
one is stale and contradicts the consolidated report.
   - **Confusion when reading per-section reports directly.** A user clicking 
through GitHub to read a per-section finding may land on the orphan and read a 
stale audit, with no indication that a newer version exists in a different 
folder.
   - **Doesn't break consolidation today**, but it's a quiet correctness 
hazard. If the orphan happens to contain different findings (different prompt, 
different Opus reasoning trace, different domain context), the per-section file 
disagrees with the consolidated report.
   
   ## Proposed fix
   
   Add a new orchestrator parameter `cleanStaleReports` (default `"false"` for 
safety; opt-in until proven). When `"true"`, after the audit phase but before 
consolidation, the orchestrator:
   
   1. Builds the set of expected paths under `outputDirectory/<repo>/<commit>/` 
from the current run's `work_items`.
   2. Lists the run directory on GitHub via the Trees API (`GET 
/repos/{owner}/{repo}/git/trees/{sha}?recursive=1`, filtered to the run path).
   3. Identifies any `.md` file under a domain folder whose path isn't in the 
expected set AND whose filename matches `^\d+\.\d+\.\d+\.md$` (so we never 
accidentally delete `consolidated.md`, `issues.md`, README files, or anything 
outside the section-report convention).
   4. Deletes each orphan via `DELETE /repos/{owner}/{repo}/contents/{path}` 
with the file's SHA.
   5. Logs a summary: how many orphans were found and deleted.
   
   If the run aborted or was partial, `cleanStaleReports=true` should be 
skipped automatically — only run cleanup after a successful audit phase. 
(Ideally we should never delete based on a partial work_items list, since a 
missing section in this run could just mean the audit failed and shouldn't 
trigger deletion of last run's good copy.)
   
   For the carve-out flow, cleanup runs against **both** repos: the private 
repo (full reports) and the public repo (redacted reports), each with its own 
token.
   
   ### Behavior matrix
   
   | `cleanStaleReports` | Audit phase status | Action |
   |---|---|---|
   | `"false"` (default) | any | No cleanup. Existing behavior. Orphans 
accumulate. |
   | `"true"` | All work items succeeded | Delete orphans matching 
`\d+\.\d+\.\d+\.md` not in expected set |
   | `"true"` | Partial failures | Skip cleanup; print warning. Orphans 
preserved (they may correspond to sections the audit failed to re-audit) |
   | `"true"` | All failed | Skip cleanup |
   
   ### What is **not** deleted
   
   - `consolidated.md`, `issues.md`, and any `consolidated-*.md` / 
`issues-*.md` variants
   - Anything not matching the strict `\d+\.\d+\.\d+\.md` section-report pattern
   - Anything outside `outputDirectory/<repo>/<commit>/`
   - Anything in subdirectories named `rerun/` (preserve manual rerun results — 
these are intentionally kept across runs per the README)
   - The `<commit>/` directory itself
   
   ### Why opt-in
   
   Defaulting to `"false"` lets us roll out the change without surprising 
existing automation. Once it's been exercised against a few real runs and we're 
confident in the safety bounds, we can flip the default to `"true"` in a 
follow-up.
   
   ## Alternatives considered
   
   **Always delete the entire run directory before pushing new files.** Brute 
force; clean. Downside: if the new audit fails partway through, you've lost the 
previous good run with nothing to fall back on. The opt-in surgical approach 
above preserves the previous run as a safety net until the new run completes 
successfully.
   
   **Detect and warn only.** Add a check that lists orphans and prints a 
warning, but doesn't delete them. Punts the problem to the user. Useful as a 
debugging aid; not a fix.
   
   **Make discovery deterministic.** Lower discovery temperature to 0, or cache 
the discovery output keyed by commit hash so re-audits get the same partition. 
Helps for re-audits but not for legitimate "I tweaked the discovery prompt and 
want to re-run" cases. And doesn't address the core issue that the pipeline is 
silent about leftover files.
   
   ## Acceptance criteria
   
   - [ ] New `cleanStaleReports` parameter on `asvs_orchestrate` (string, 
defaults to `"false"`)
   - [ ] When enabled and the audit phase fully succeeded, orphan 
section-report files (matching `\d+\.\d+\.\d+\.md`) are deleted from the run 
directory
   - [ ] `consolidated.md`, `issues.md`, the `<commit>/` dir itself, and 
`rerun/` subdirs are never deleted
   - [ ] Cleanup is skipped (with a logged warning) when audit phase had 
failures
   - [ ] In carve-out mode, cleanup applies to both private and public repos 
using their respective tokens
   - [ ] Cleanup logs a summary line: `Cleanup: deleted N orphan section 
reports from <repo>/<dir>`
   - [ ] Tested against a re-audit of `apache/mahout` at the same commit with 
discovery seeded to produce a different partition; verify orphans are removed 
and `consolidated.md` content is unaffected
   - [ ] README updated: new param documented in the orchestrate input table; 
new troubleshooting note about discovery non-determinism and how to enable 
cleanup
   
   ## Out of scope for this issue
   
   - Cross-commit cleanup (e.g., deleting old `<commit_hash>/` directories that 
are no longer interesting). Different problem, different policy.
   - Cleanup of stale entries in the `audit-cache:*` data store namespaces. 
Unrelated; those caches are content-derived (file-set hash) and self-invalidate 
on real changes.
   - Making discovery deterministic. Worth its own discussion, but doesn't 
substitute for cleanup.
   
   ## Related
   
   - The existing `--consolidate-only` mode in `rerun-sections.sh` doesn't 
trigger this issue because it doesn't re-audit anything; it only reads existing 
per-section reports. Cleanup should NOT run in `--consolidate-only` mode either.
   - README QA section: the "check for missing reports" snippet should be 
updated once cleanup ships, since it currently has no way to detect duplicates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to