potiuk opened a new pull request, #922:
URL: https://github.com/apache/nutch/pull/922

   **This is a v0 draft proposal for the Nutch PMC to review — please correct, 
reject, or discuss as needed.** Following up on Lewis's note to go ahead and 
draft the model so it keeps momentum.
   
   **Context.** The ASF Security team is preparing the project for an automated 
agentic security scan we're piloting; the scan runs against a threat model so 
its output is signal rather than noise. Discoverability already landed in #920; 
this PR adds the model content.
   
   **What's in this PR:**
   - **`THREAT_MODEL.md`** (new) — a v0 threat model written from your website 
security model + the codebase, following the [threat-model-producer 
rubric](https://gist.github.com/potiuk/da14a826283038ddfe38cc9fe6310573). It is 
a **strict superset of the website security model** — nothing there is dropped; 
the sections the website page didn't cover (adversary model, enumerated 
properties, known non-findings, triage dispositions) are added and tagged 
*(inferred)* for you to confirm / correct / strike. Draft confidence ~10 
documented / 22 inferred.
   - **`AGENTS.md` + `SECURITY.md`** — re-pointed so the chain resolves 
`AGENTS.md → SECURITY.md → THREAT_MODEL.md`, keeping the website references 
intact.
   
   **The framing to sanity-check:** Nutch fetches + parses untrusted web 
content by design, so crawler "SSRF" (reaching internal/arbitrary URLs) and 
"parses hostile HTML/XML" are by-design — scoped by your URL filters, not by 
Nutch refusing. The in-model adversary is the **malicious crawled-content 
supplier** (XXE, parser DoS, decompression bombs, ReDoS in URL-filter regex). 
§11a captures those recurring false positives.
   
   **What we'd need from the PMC:** walk §14 (3 waves) — a one-line confirm / 
correct / strike per question is enough; wave 1 (trusted-environment posture 
incl. `nutch-server`, crawler-SSRF/scope, the crawled-content adversary) shapes 
the rest. §14.7 asks whether this in-repo model or the website page should be 
canonical.
   
   If you'd rather adjust the approach, comment on the PR or close it — 
entirely your call.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to