Re: [PR] Add in-repo threat model and point AGENTS.md/SECURITY.md at it [nutch]

via GitHub Sat, 06 Jun 2026 15:01:40 -0700


lewismc commented on PR #922:
URL: https://github.com/apache/nutch/pull/922#issuecomment-4640505123

@sebastian-nagel
Is there attention to be paid to
[§8.1](https://github.com/potiuk/nutch/blob/asf-security/threat-model-2026-06-06/THREAT_MODEL.md#8-security-properties-the-project-provides)
the `fetcher.parse` and `fetcher.store.content` properties (although mitigated
by default) can also result in corrupted segments in error scenarios. Maybe
this is covered within the threat model language but thought I'd double check.

I also think that §8.1 **No execution of fetched content** needs more
thought. I'm fairly sure CVE's have been registered against Tika for code
execution from archive files. I may be wrong but thought I'd raise it as well.
**EDIT** I just noticed this is covered by the **Decompression/zip bombs**
stated in §9

Here's my take on §14

> Wave 1 — scope & adversary (these shape everything):
Trusted-environment posture / nutch-server. Proposed: the supported posture
is "trusted environment only"; an exposed no-auth nutch-server is OUT-OF-MODEL:
non-default-build. Correct? (→ §5a, §3, §11a)

Correct

> Crawler SSRF / scope. Proposed: fetching internal/arbitrary URLs is
by-design and controlled by operator URL filters, not a Nutch vulnerability;
only escaping the configured scope is in-model. Correct? (→ §9, §11a)

Correct

> Primary adversary = the crawled-content supplier. Proposed: the main
in-model attacker is whoever controls fetched content (hostile HTML/XML/feeds/
redirects), and parser robustness against it is the core property. Agree? Any
other in-model adversary? (→ §7)

Agree. No further comments, for now.

> Wave 2 — properties & parsers: 4. Parser hardening (XXE / bombs). Are
XML/feed parsers configured against XXE and decompression/entity bombs by
default, or is that operator config? (→ §8, §9)

By default via http.content.limit and CVE fixes in tika parsers. What about
parse-zip?

> 5. Resource line. Where is the line between an in-model parser-DoS on
crafted content and an out-of-model "expensive legitimate crawl"? (→ §8)

This is operator dependent and would need to be assessed on a crawl-by-crawl
basis. Instrumentation efforts may improve this understanding and we do have
[ErrorTracker](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/metrics/ErrorTracker.java)
to assist operators.

6. Supported plugin set. Which protocol-* / parse-* / indexer-* plugins are
first-class for security vs. contrib/unsupported? (→ §2/§3/§5a)

The Nutch project has not concept of `contrib`. I believe the project
considers all _official_ plugins falling into the first class bucket. Any other
plugins are out with the project's control.

> Wave 3 — meta: 7. Canonicalization. This in-repo THREAT_MODEL.md is
drafted as a superset of the website security-model page. Proposed: SECURITY.md
points here for the full model, and the website page stays the operator-facing
how-to. Agree, or should the website page remain canonical with this as a
supplement? (→ meta)

Agree, we can do that easily.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add in-repo threat model and point AGENTS.md/SECURITY.md at it [nutch]

Reply via email to