lewismc commented on PR #922:
URL: https://github.com/apache/nutch/pull/922#issuecomment-4640505123

   @sebastian-nagel
   Is there attention to be paid to 
[§8.1](https://github.com/potiuk/nutch/blob/asf-security/threat-model-2026-06-06/THREAT_MODEL.md#8-security-properties-the-project-provides)
 the `fetcher.parse` and `fetcher.store.content` properties (although mitigated 
by default) can also result in corrupted segments in error scenarios. Maybe 
this is covered within the threat model language but thought I'd double check.
   
   I also think that §8.1 **No execution of fetched content** needs more 
thought. I'm fairly sure CVE's have been registered against Tika for code 
execution from archive files. I may be wrong but thought I'd raise it as well. 
**EDIT** I just noticed this is covered by the **Decompression/zip bombs** 
stated in §9
   
   Here's my take on §14
   
   > Wave 1 — scope & adversary (these shape everything):
   Trusted-environment posture / nutch-server. Proposed: the supported posture 
is "trusted environment only"; an exposed no-auth nutch-server is OUT-OF-MODEL: 
non-default-build. Correct? (→ §5a, §3, §11a)
   
   Correct
   
   > Crawler SSRF / scope. Proposed: fetching internal/arbitrary URLs is 
by-design and controlled by operator URL filters, not a Nutch vulnerability; 
only escaping the configured scope is in-model. Correct? (→ §9, §11a)
   
   Correct
   
   > Primary adversary = the crawled-content supplier. Proposed: the main 
in-model attacker is whoever controls fetched content (hostile HTML/XML/feeds/ 
redirects), and parser robustness against it is the core property. Agree? Any 
other in-model adversary? (→ §7)
   
   Agree. No further comments, for now.
   
   > Wave 2 — properties & parsers: 4. Parser hardening (XXE / bombs). Are 
XML/feed parsers configured against XXE and decompression/entity bombs by 
default, or is that operator config? (→ §8, §9) 
   
   By default via http.content.limit and CVE fixes in tika parsers. What about 
parse-zip?
   
   > 5. Resource line. Where is the line between an in-model parser-DoS on 
crafted content and an out-of-model "expensive legitimate crawl"? (→ §8)
   
   This is operator dependent and would need to be assessed on a crawl-by-crawl 
basis. Instrumentation efforts may improve this understanding and we do have 
[ErrorTracker](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/metrics/ErrorTracker.java)
 to assist operators. 
   
   6. Supported plugin set. Which protocol-* / parse-* / indexer-* plugins are 
first-class for security vs. contrib/unsupported? (→ §2/§3/§5a)
   
   The Nutch project has not concept of `contrib`. I believe the project 
considers all _official_ plugins falling into the first class bucket. Any other 
plugins are out with the project's control. 
   
   > Wave 3 — meta: 7. Canonicalization. This in-repo THREAT_MODEL.md is 
drafted as a superset of the website security-model page. Proposed: SECURITY.md 
points here for the full model, and the website page stays the operator-facing 
how-to. Agree, or should the website page remain canonical with this as a 
supplement? (→ meta)
   
   Agree, we can do that easily.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to