sebastian-nagel commented on PR #922:
URL: https://github.com/apache/nutch/pull/922#issuecomment-4669330287

   > Is there attention to be paid to 
[§8.1](https://github.com/potiuk/nutch/blob/asf-security/threat-model-2026-06-06/THREAT_MODEL.md#8-security-properties-the-project-provides)
 the `fetcher.parse` and `fetcher.store.content` properties (although mitigated 
by default) can also result in corrupted segments in error scenarios.
   
   - `fetcher.parse`: The fetcher uses the ParseUtil class to let the parser 
run in an ExecutorService with a default timeout of 30 seconds. In this regard, 
there is no difference to parsing the segments content in a separate job. Of 
course, a parsing fetcher map task may require more resources (mostly RAM, less 
CPU) than a parser map task because it does both fetching and parsing, and also 
because of its multi-threaded implementation.
   - `fetcher.store.content`: By default, `http.content.limit` (resp. 
`ftp.content.limit`) restricts the size of stored content to 1 MiB per 
document. To run a separate parse job, content needs to be stored in the 
segment. Truncated contents, exceeding the content limit, are not parsed by 
default (property `parser.skip.truncated`).
   
   In case of an unhandled error, the fetch or parse task should fail, and not 
produce any final output. That is, the segment is incomplete (no fetch or parse 
data), but not "corrupted".
   
   Generally, we might add notes about resource requirements and limitations:
   
   1. Nutch's Fetcher is multi-threaded in nature. It buffers the fetched 
content before it's spilled to disk, caches robots.txt rules, and optionally 
parses the fetched documents. The Fetcher requires sufficient RAM to hold data 
temporarilly in memory and sufficient CPU to process the content in parallel 
threads. The resource requirements depend on the configuration and the number 
of assigned fetcher threads.
   
   2. Nutch's processing "pipeline" to fetch, parse and index web content is 
designed to process reasonable-size web documents. By default, the size of a 
single document (HTML, PDF, or any other document format) is limited to 1 MiB. 
Nutch can process larger document, if sufficient RAM is configured. However, 
there are limitations:
      - Nutch buffers documents using a Java byte array which is limited to 
about 2 GiB. This is a hard limitation.
      - Nutch cannot process "indefinite" content streams (internet radio or 
TV). In addition to the content limit, there are strict timeouts to download a 
single document.
      - Usually, it's already troublesome to make Nutch processing documents 
exceeding 100 MiB. In other words, Nutch isn't the ideal tool to collect, e.g., 
very large PDF files.
   
   Where we should add these limitations? It's also a description what Nutch 
cannot be used for.
   
   
   > What about parse-zip?
   
   It's rarely used, and a simple plugin, parsing zip files recursively. 
Likely, it's not safe. Again, the 30 second parse timeout protects it from the 
worst outcome. But we very eventually need to drop it, fix it or mark it as 
unsafe, that is to be used only for trusted content.
   
   Btw., zip bombs also threaten the protocol plugins per `Content-Encoding` 
and `Transfer-Encoding`, at least, for the encodings `deflate`, `gzip` and 
`br`. Since the content limit is applied to the decoded/uncompressed content, 
the protocol plugins should be safe. But we eventually should test all protocol 
plugins.
   
   >> 5. Resource line. Where is the line between an in-model parser-DoS on 
crafted content and an out-of-model "expensive legitimate crawl"? (→ §8)
   
   > This is operator dependent and would need to be assessed on a 
crawl-by-crawl basis.
   
   +1
   
   >> Wave 3 — meta: 7. Canonicalization. This in-repo THREAT_MODEL.md is 
drafted as a superset of the website security-model page. Proposed: SECURITY.md 
points here for the full model, and the website page stays the operator-facing 
how-to. Agree, or should the website page remain canonical with this as a 
supplement? (→ meta)
   
   > Agree, we can do that easily.
   
   +1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to