Re: [PR] Add in-repo threat model and point AGENTS.md/SECURITY.md at it [nutch]

via GitHub Wed, 10 Jun 2026 03:50:30 -0700


sebastian-nagel commented on PR #922:
URL: https://github.com/apache/nutch/pull/922#issuecomment-4669330287

> Is there attention to be paid to
[§8.1](https://github.com/potiuk/nutch/blob/asf-security/threat-model-2026-06-06/THREAT_MODEL.md#8-security-properties-the-project-provides)
the `fetcher.parse` and `fetcher.store.content` properties (although mitigated
by default) can also result in corrupted segments in error scenarios.

- `fetcher.parse`: The fetcher uses the ParseUtil class to let the parser
run in an ExecutorService with a default timeout of 30 seconds. In this regard,
there is no difference to parsing the segments content in a separate job. Of
course, a parsing fetcher map task may require more resources (mostly RAM, less
CPU) than a parser map task because it does both fetching and parsing, and also
because of its multi-threaded implementation.
- `fetcher.store.content`: By default, `http.content.limit` (resp.
`ftp.content.limit`) restricts the size of stored content to 1 MiB per
document. To run a separate parse job, content needs to be stored in the
segment. Truncated contents, exceeding the content limit, are not parsed by
default (property `parser.skip.truncated`).

In case of an unhandled error, the fetch or parse task should fail, and not
produce any final output. That is, the segment is incomplete (no fetch or parse
data), but not "corrupted".

Generally, we might add notes about resource requirements and limitations:

1. Nutch's Fetcher is multi-threaded in nature. It buffers the fetched
content before it's spilled to disk, caches robots.txt rules, and optionally
parses the fetched documents. The Fetcher requires sufficient RAM to hold data
temporarilly in memory and sufficient CPU to process the content in parallel
threads. The resource requirements depend on the configuration and the number
of assigned fetcher threads.

2. Nutch's processing "pipeline" to fetch, parse and index web content is
designed to process reasonable-size web documents. By default, the size of a
single document (HTML, PDF, or any other document format) is limited to 1 MiB.
Nutch can process larger document, if sufficient RAM is configured. However,
there are limitations:
- Nutch buffers documents using a Java byte array which is limited to
about 2 GiB. This is a hard limitation.
- Nutch cannot process "indefinite" content streams (internet radio or
TV). In addition to the content limit, there are strict timeouts to download a
single document.
- Usually, it's already troublesome to make Nutch processing documents
exceeding 100 MiB. In other words, Nutch isn't the ideal tool to collect, e.g.,
very large PDF files.

Where we should add these limitations? It's also a description what Nutch
cannot be used for.

> What about parse-zip?

It's rarely used, and a simple plugin, parsing zip files recursively.
Likely, it's not safe. Again, the 30 second parse timeout protects it from the
worst outcome. But we very eventually need to drop it, fix it or mark it as
unsafe, that is to be used only for trusted content.

Btw., zip bombs also threaten the protocol plugins per `Content-Encoding`
and `Transfer-Encoding`, at least, for the encodings `deflate`, `gzip` and
`br`. Since the content limit is applied to the decoded/uncompressed content,
the protocol plugins should be safe. But we eventually should test all protocol
plugins.

>> 5. Resource line. Where is the line between an in-model parser-DoS on
crafted content and an out-of-model "expensive legitimate crawl"? (→ §8)

> This is operator dependent and would need to be assessed on a
crawl-by-crawl basis.

>> Wave 3 — meta: 7. Canonicalization. This in-repo THREAT_MODEL.md is
drafted as a superset of the website security-model page. Proposed: SECURITY.md
points here for the full model, and the website page stays the operator-facing
how-to. Agree, or should the website page remain canonical with this as a
supplement? (→ meta)

> Agree, we can do that easily.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add in-repo threat model and point AGENTS.md/SECURITY.md at it [nutch]

Reply via email to