potiuk opened a new pull request, #922: URL: https://github.com/apache/nutch/pull/922
**This is a v0 draft proposal for the Nutch PMC to review — please correct, reject, or discuss as needed.** Following up on Lewis's note to go ahead and draft the model so it keeps momentum. **Context.** The ASF Security team is preparing the project for an automated agentic security scan we're piloting; the scan runs against a threat model so its output is signal rather than noise. Discoverability already landed in #920; this PR adds the model content. **What's in this PR:** - **`THREAT_MODEL.md`** (new) — a v0 threat model written from your website security model + the codebase, following the [threat-model-producer rubric](https://gist.github.com/potiuk/da14a826283038ddfe38cc9fe6310573). It is a **strict superset of the website security model** — nothing there is dropped; the sections the website page didn't cover (adversary model, enumerated properties, known non-findings, triage dispositions) are added and tagged *(inferred)* for you to confirm / correct / strike. Draft confidence ~10 documented / 22 inferred. - **`AGENTS.md` + `SECURITY.md`** — re-pointed so the chain resolves `AGENTS.md → SECURITY.md → THREAT_MODEL.md`, keeping the website references intact. **The framing to sanity-check:** Nutch fetches + parses untrusted web content by design, so crawler "SSRF" (reaching internal/arbitrary URLs) and "parses hostile HTML/XML" are by-design — scoped by your URL filters, not by Nutch refusing. The in-model adversary is the **malicious crawled-content supplier** (XXE, parser DoS, decompression bombs, ReDoS in URL-filter regex). §11a captures those recurring false positives. **What we'd need from the PMC:** walk §14 (3 waves) — a one-line confirm / correct / strike per question is enough; wave 1 (trusted-environment posture incl. `nutch-server`, crawler-SSRF/scope, the crawled-content adversary) shapes the rest. §14.7 asks whether this in-repo model or the website page should be canonical. If you'd rather adjust the approach, comment on the PR or close it — entirely your call. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]

