lewismc commented on PR #922: URL: https://github.com/apache/nutch/pull/922#issuecomment-4640505123
@sebastian-nagel Is there attention to be paid to [§8.1](https://github.com/potiuk/nutch/blob/asf-security/threat-model-2026-06-06/THREAT_MODEL.md#8-security-properties-the-project-provides) the `fetcher.parse` and `fetcher.store.content` properties (although mitigated by default) can also result in corrupted segments in error scenarios. Maybe this is covered within the threat model language but thought I'd double check. I also think that §8.1 **No execution of fetched content** needs more thought. I'm fairly sure CVE's have been registered against Tika for code execution from archive files. I may be wrong but thought I'd raise it as well. **EDIT** I just noticed this is covered by the **Decompression/zip bombs** stated in §9 Here's my take on §14 > Wave 1 — scope & adversary (these shape everything): Trusted-environment posture / nutch-server. Proposed: the supported posture is "trusted environment only"; an exposed no-auth nutch-server is OUT-OF-MODEL: non-default-build. Correct? (→ §5a, §3, §11a) Correct > Crawler SSRF / scope. Proposed: fetching internal/arbitrary URLs is by-design and controlled by operator URL filters, not a Nutch vulnerability; only escaping the configured scope is in-model. Correct? (→ §9, §11a) Correct > Primary adversary = the crawled-content supplier. Proposed: the main in-model attacker is whoever controls fetched content (hostile HTML/XML/feeds/ redirects), and parser robustness against it is the core property. Agree? Any other in-model adversary? (→ §7) Agree. No further comments, for now. > Wave 2 — properties & parsers: 4. Parser hardening (XXE / bombs). Are XML/feed parsers configured against XXE and decompression/entity bombs by default, or is that operator config? (→ §8, §9) By default via http.content.limit and CVE fixes in tika parsers. What about parse-zip? > 5. Resource line. Where is the line between an in-model parser-DoS on crafted content and an out-of-model "expensive legitimate crawl"? (→ §8) This is operator dependent and would need to be assessed on a crawl-by-crawl basis. Instrumentation efforts may improve this understanding and we do have [ErrorTracker](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/metrics/ErrorTracker.java) to assist operators. 6. Supported plugin set. Which protocol-* / parse-* / indexer-* plugins are first-class for security vs. contrib/unsupported? (→ §2/§3/§5a) The Nutch project has not concept of `contrib`. I believe the project considers all _official_ plugins falling into the first class bucket. Any other plugins are out with the project's control. > Wave 3 — meta: 7. Canonicalization. This in-repo THREAT_MODEL.md is drafted as a superset of the website security-model page. Proposed: SECURITY.md points here for the full model, and the website page stays the operator-facing how-to. Agree, or should the website page remain canonical with this as a supplement? (→ meta) Agree, we can do that easily. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]

