sebastian-nagel commented on PR #922: URL: https://github.com/apache/nutch/pull/922#issuecomment-4669330287
> Is there attention to be paid to [§8.1](https://github.com/potiuk/nutch/blob/asf-security/threat-model-2026-06-06/THREAT_MODEL.md#8-security-properties-the-project-provides) the `fetcher.parse` and `fetcher.store.content` properties (although mitigated by default) can also result in corrupted segments in error scenarios. - `fetcher.parse`: The fetcher uses the ParseUtil class to let the parser run in an ExecutorService with a default timeout of 30 seconds. In this regard, there is no difference to parsing the segments content in a separate job. Of course, a parsing fetcher map task may require more resources (mostly RAM, less CPU) than a parser map task because it does both fetching and parsing, and also because of its multi-threaded implementation. - `fetcher.store.content`: By default, `http.content.limit` (resp. `ftp.content.limit`) restricts the size of stored content to 1 MiB per document. To run a separate parse job, content needs to be stored in the segment. Truncated contents, exceeding the content limit, are not parsed by default (property `parser.skip.truncated`). In case of an unhandled error, the fetch or parse task should fail, and not produce any final output. That is, the segment is incomplete (no fetch or parse data), but not "corrupted". Generally, we might add notes about resource requirements and limitations: 1. Nutch's Fetcher is multi-threaded in nature. It buffers the fetched content before it's spilled to disk, caches robots.txt rules, and optionally parses the fetched documents. The Fetcher requires sufficient RAM to hold data temporarilly in memory and sufficient CPU to process the content in parallel threads. The resource requirements depend on the configuration and the number of assigned fetcher threads. 2. Nutch's processing "pipeline" to fetch, parse and index web content is designed to process reasonable-size web documents. By default, the size of a single document (HTML, PDF, or any other document format) is limited to 1 MiB. Nutch can process larger document, if sufficient RAM is configured. However, there are limitations: - Nutch buffers documents using a Java byte array which is limited to about 2 GiB. This is a hard limitation. - Nutch cannot process "indefinite" content streams (internet radio or TV). In addition to the content limit, there are strict timeouts to download a single document. - Usually, it's already troublesome to make Nutch processing documents exceeding 100 MiB. In other words, Nutch isn't the ideal tool to collect, e.g., very large PDF files. Where we should add these limitations? It's also a description what Nutch cannot be used for. > What about parse-zip? It's rarely used, and a simple plugin, parsing zip files recursively. Likely, it's not safe. Again, the 30 second parse timeout protects it from the worst outcome. But we very eventually need to drop it, fix it or mark it as unsafe, that is to be used only for trusted content. Btw., zip bombs also threaten the protocol plugins per `Content-Encoding` and `Transfer-Encoding`, at least, for the encodings `deflate`, `gzip` and `br`. Since the content limit is applied to the decoded/uncompressed content, the protocol plugins should be safe. But we eventually should test all protocol plugins. >> 5. Resource line. Where is the line between an in-model parser-DoS on crafted content and an out-of-model "expensive legitimate crawl"? (→ §8) > This is operator dependent and would need to be assessed on a crawl-by-crawl basis. +1 >> Wave 3 — meta: 7. Canonicalization. This in-repo THREAT_MODEL.md is drafted as a superset of the website security-model page. Proposed: SECURITY.md points here for the full model, and the website page stays the operator-facing how-to. Agree, or should the website page remain canonical with this as a supplement? (→ meta) > Agree, we can do that easily. +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]

