lewismc commented on code in PR #922:
URL: https://github.com/apache/nutch/pull/922#discussion_r3368251330


##########
THREAT_MODEL.md:
##########
@@ -0,0 +1,327 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Apache Nutch — Security Threat Model
+
+## §1 Header
+
+- **Project:** Apache Nutch (core crawler — `apache/nutch`)
+- **Modeled against:** `master` HEAD as of 2026-06-06.
+- **Authors:** ASF Security team (v0 draft, generated via the
+  `threat-model-producer` rubric at the Nutch PMC's request), for the PMC to 
review.
+- **Status:** **DRAFT v0 — draft-first, not yet maintainer-ratified.** Built 
as a
+  superset of the project's existing website security model; the sections that
+  were not covered there are *(inferred)* and must be confirmed (see §14).
+- **Version binding:** versioned with the project; a report against release *N*
+  is triaged against the model as it stood at *N*.
+- **Reporting cross-reference:** findings that violate a §8 property should be
+  reported privately per the project's disclosure channel 
(`[email protected]`);
+  findings under §3 or §9 are closed citing this document.
+- **Provenance legend:** *(documented)* = stated in Nutch's docs/website 
security
+  model; *(maintainer)* = confirmed by a Nutch PMC member; *(inferred)* = 
reasoned
+  from docs/code/domain knowledge, not yet confirmed (each has a §14 question).
+- **Draft confidence:** ~10 documented / 0 maintainer / ~22 inferred.
+- **Relationship to the website model:** this document is a strict superset of
+  the security model published at
+  <https://nutch.apache.org/documentation/security/#security-model>; nothing 
that
+  page asserts is dropped or weakened here. Where this draft adds a section the
+  page did not cover (adversary model, enumerated properties, known 
non-findings,
+  triage dispositions), it is tagged *(inferred)* pending PMC confirmation.
+
+**What Nutch is.** Apache Nutch is an extensible, Hadoop-based open-source web
+crawler: an operator runs it (in local or distributed/batch mode, or via the
+`nutch-server` REST control API) to fetch, parse, and index web content at
+scale through a plugin architecture (protocol / parse / index / scoring
+plugins; parsing largely via Apache Tika). The defining security fact is that
+**Nutch fetches and parses content from the open, untrusted web by design** —
+the crawled bytes are attacker-controllable input, and the threat model is 
about
+robustly handling that input and about which controls are the operator's job,
+not about preventing Nutch from reaching or parsing hostile content.
+
+## §2 Scope and intended use
+
+- **Primary intended use** *(documented)*: an operator-deployed crawler run
+  **in a trusted environment**, fetching web content into a crawl store and
+  handing parsed content to an indexer. The website model states Nutch is
+  "designed to operate in trusted environments, either locally or on a Hadoop
+  cluster."
+- **Deployment shape** *(documented/inferred)*: batch crawl jobs (local or on
+  Hadoop), plus an optional `nutch-server` REST API for orchestration.
+- **Caller roles:**
+  - **operator / admin** — trusted; owns seeds, URL filters, plugin config, the
+    Hadoop/host environment, and the `nutch-server` endpoint.
+  - **crawled-content supplier** — the untrusted web: every fetched page,
+    redirect, `robots.txt`, sitemap, feed, and embedded resource is
+    attacker-controllable input. *(inferred — the core in-model adversary.)*
+  - **REST API client** — in the trusted environment per the website model; the
+    legacy REST API provided no authentication. *(documented)*
+
+**Component-family table** *(inferred — confirm in §14)*:
+
+| Family | Entry point | Touches outside process? | In model? |
+| --- | --- | --- | --- |
+| Fetcher / protocol plugins | `protocol-http(client)`, etc. | network 
(arbitrary URLs) | **yes** — consumes untrusted content |
+| Parser plugins (Tika, HTML, feed, etc.) | `parse-*` | CPU/memory on 
untrusted bytes | **yes** |
+| URL filtering / normalization | `urlfilter-*`, `urlnormalizer-*` (regex) | — 
| **yes** (scoping boundary) |
+| Crawl store / DB | local FS / HDFS | disk | **yes** |
+| Indexer plugins | `indexer-solr`, `-elastic`, … | network (backend) | 
**boundary** — backend is the operator's |
+| `nutch-server` REST API | HTTP control endpoint | network | **yes** |
+| Bundled/contrib plugins, examples | various | varies | **per-plugin** — 
confirm supported set (§14) |
+
+## §3 Out of scope (explicit non-goals)
+
+- **Defending a Nutch deployment exposed outside a trusted environment** (e.g.
+  an internet-reachable `nutch-server` with no fronting auth). The website 
model
+  scopes Nutch to trusted environments. *(documented)*
+- **Preventing Nutch from fetching arbitrary or internal URLs.** Reaching URLs
+  is the crawler's function; restricting *which* URLs is the operator's job via
+  URL filters / seed scoping (so crawler "SSRF" is operator-config, see 
§9/§11a).
+  *(inferred)*
+- **The security of indexer/storage backends** (Solr/Elasticsearch/HDFS) and 
the
+  Hadoop cluster — Nutch writes to them; it does not own their security.
+  *(inferred)*
+- **Contrib / unsupported plugins / examples** — threat-modeled separately;
+  confirm the supported plugin set in §14. *(inferred)*
+
+## §4 Trust boundaries and data flow
+
+Two boundaries matter, and they are different from a typical service:
+1. **The fetch boundary (primary).** Everything Nutch fetches from the web is
+   untrusted and crosses into the parser/store. The trust transition is
+   "bytes-from-the-internet → parsed structures." *(inferred)*
+2. **The operator/config boundary.** Seeds, URL filters, plugin selection, and
+   the `nutch-server` endpoint are operator-controlled and trusted. 
*(inferred)*
+
+Data flow: seed URLs (operator) → fetch (untrusted content) → parse 
(Tika/plugins
+on untrusted bytes) → URL extraction + filtering → crawl DB → index to backend.
+
+**Reachability preconditions per family:**
+- A fetcher/parser finding is in-model when it is reachable **from
+  attacker-controlled fetched content** (a hostile page/feed/redirect) — that 
is
+  the core in-model surface.
+- A `nutch-server` REST finding is in-model only if it is reachable by a 
network
+  client the deployment was not supposed to expose (and the website model 
assumes
+  a trusted environment).
+- A finding that requires control of seeds / URL filters / plugin config is
+  **out of model** (operator-trusted input).
+
+## §5 Assumptions about the environment
+
+- **Trusted operator environment** *(documented)*: Nutch runs where only 
trusted
+  operators reach its control surfaces; HTTPS / network isolation for any 
exposed
+  endpoint is the operator's responsibility.
+- **Operator-controlled config** *(inferred)*: seeds, `regex-urlfilter`,
+  normalizers, plugin set, and backend credentials are trusted inputs.
+- **Backends provisioned by the operator** *(inferred)*: Solr/Elastic/HDFS.
+- **What Nutch does to its host** *(inferred — §14)*: opens outbound network
+  connections to arbitrary fetched hosts; reads/writes the crawl store; runs
+  parser libraries over untrusted bytes; with `nutch-server`, opens a listening
+  HTTP port. Not expected to run as root.
+
+## §5a Build-time and configuration variants
+
+Config knobs that move the security envelope *(inferred — confirm in §14)*:
+
+| Knob | Default | Effect | Maintainer stance |
+| --- | --- | --- | --- |
+| `nutch-server` REST API | not started unless launched; **no auth** | An 
exposed control endpoint with no auth on an untrusted network = unauthenticated 
control | **?** trusted-env-only posture — §14.1 |

Review Comment:
   Aye I think we should relax that. I'll investigate. I also want to relax the 
SonarCloud >80% test coverage per PR. That is unrealistic. I'll address both. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to