[
https://issues.apache.org/jira/browse/HADOOP-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091845#comment-18091845
]
ASF GitHub Bot commented on HADOOP-19925:
-----------------------------------------
steveloughran commented on code in PR #8562:
URL: https://github.com/apache/hadoop/pull/8562#discussion_r3481388512
##########
SECURITY.md:
##########
@@ -0,0 +1,566 @@
+SPDX-License-Identifier: Apache-2.0
+
+# Apache Hadoop Security Model
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
+NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
+"OPTIONAL" in this document are to be interpreted as described in
+RFC 2119.
+
+This document defines the security model of Apache Hadoop: the deployments it
is
+designed to protect, the boundaries it defends, and — equally importantly — the
+things which are *not* vulnerabilities. It exists for human reporters and for
+anyone using automated or AI-assisted tooling to look for security issues.
+
+**TL;DR: Hadoop's security model defends a Kerberos-secured cluster running on
a
+trusted operating system, behind a network perimeter, with a valid site
+configuration. Findings which only apply outside that model are bugs, not
+vulnerabilities.**
+
+## Before Filing a Report (Including AI-Assisted Reports)
+
+The deployment Hadoop's security model defends is a **Kerberos-secured
cluster**.
+Many findings that look like vulnerabilities in other contexts are not
+vulnerabilities here, because the surrounding deployment is trusted by design.
+
+You *MUST NOT* file a security report for:
+
+- Issues that require the operator to edit their own Hadoop site configuration,
+ place malicious files on their own classpath, or pass malicious arguments to
+ their own command invocation.
+- **Job submission running user-supplied code.** Submitting work to YARN or
+ MapReduce executes the submitter's code as the submitter's identity. That is
+ the product, not a vulnerability. See the threat model below.
+- **Denial of service at scale.** A large Hadoop cluster exists to execute jobs
+ at scale; such a cluster can itself be used to mount distributed attacks, and
+ authenticated users can exhaust resources. Resource exhaustion and
performance
+ degradation from legitimate authenticated use are out of scope.
+- Issues that require the attacker to already hold cluster or remote-store
+ credentials, a valid Kerberos principal, or local disk access.
+- Anything against the **default insecure (non-Kerberos) mode** — it is
insecure
+ by design (see the deployment model below).
+- **Transitive CVEs** in dependencies Hadoop builds or ships against. See
+ [Third Party Modules](#third-party-modules).
+- Raw **scanner output** (Snyk, Dependabot, Trivy, Zizmor, etc.) without a
+ reproducer against the current `trunk` branch.
+- Theoretical findings ("an attacker who could X might then Y") without a
+ reproduction.
+
+
+A valid report includes:
+
+- The Hadoop version, and ideally the git SHA it was reproduced against.
+- The exact steps, configuration, and commands used to reproduce it.
+- The observed in-scope failure, and what was expected instead.
+- Where a CVE/CVSS score is claimed, the reasoning behind that score.
+
+### For Partly/Fully AI-Generated Reports
+
+AI-assisted reports are accepted **only** if the submitter has verified the
+finding by hand against current source and includes a runnable reproducer.
+
+In addition, the submitter of an AI-generated report is
+
+1. REQUIRED to understand what Hadoop is, to understand the claimed
vulnerability,
+and to be able to explain it in their own words — including justifying any
claimed CVE or CVSS
+scores. If the submitter is unable to do this, then any credit for a resulting
+CVE will be assigned to the AI tool alone, and not to the submitter.
+
+2. MUST declare the AI tool used, and provide the prompt.
+ The prompt is a key part of AI tool reports, and we need to be able to
track/replicate these.
+
+*Unverified LLM-generated reports waste maintainer time and will be closed
+without further response.*
+
+
+## Reporting a Vulnerability
+
+Report security vulnerabilities in Apache Hadoop privately to
+**[email protected]**.
+
+* Do not cc: any public mailing list.
+* Do **not** open a public JIRA issue, GitHub
+issue, or pull request for an unfixed vulnerability.
+
+For vulnerabilities in CI pipelines, see
+[Reporting Vulnerabilities in CI
Pipelines](#reporting-vulnerabilities-in-ci-pipelines).
+
+See the Apache Software Foundation's
+[guidelines for reporting security issues](https://www.apache.org/security/)
for
+the responsible-disclosure process that applies to all ASF projects.
+
+## Third Party Modules
+
+### Reporting a Known CVE in a Hadoop Dependency
+
+Do not report the existence of a published CVE in a Hadoop dependency
+to the security list. These are published and do not need to be treated as
+confidential.
+
+These are considered improvements in the project, and are managed in
+the project's [issue tracker](https://issues.apache.org/jira/issues/).
+1. Search for any existing issue covering the dependency upgrade.
+2. If it exists, read it, its discussion, the PRs etc, and see what versions
+ it has been merged to.
+3. If it hasn't been merged, look at why and get involved: major work is
likely to be
+ needed.
+4. If there isn't an issue, create one and start work on the PR!
+
+Tip: an easy way to check for the version of a library to ship in the trunk
+release of hadoop is the [LICENSE-binary](./LICENSE-binary) file.
+
+Please do not send an email listing the CVEs an automated scan
+tool reported and requesting updates, timelines etc.
+Open source development is a community process, and addressing this is done
+in the [developer mailing lists](https://hadoop.apache.org/mailing_lists.html).
+Join the community to help get your needs addressed.
+
+### Providing Advance Warning of a Critical CVE in a Hadoop Dependency
+
+If a team providing a library which Hadoop bundles has a critical CVE which
+a forthcoming fix will correct, they are encouraged to notify the hadoop
security
+list so we can co-ordinate releases.
+
+We treat all such reports as confidential.
+
+### Reporting a Newly-Discovered Vulnerability in a Third-Party Module
+
+Security bugs in third-party modules (the JVM, the Kerberos infrastructure,
cloud
+SDKs, connectors, or any other dependency) should be reported to their
respective
+maintainers, through their own security-reporting mechanisms — after verifying
+the issue is in scope of *their* threat model and reproduces against *their*
+current release.
+
+## Supported Versions
+
+Security fixes are made only to the most recent Apache Hadoop release line(s).
+Older release lines are end-of-life and do not receive security updates; the
+remedy for a vulnerability in an old line is to upgrade. Refer to the
+[Apache Hadoop release and download
policy](https://hadoop.apache.org/releases.html)
+for which lines are currently maintained. A report MUST be reproducible
against a
+maintained release or the current `trunk` branch.
+
+## The Hadoop Threat Model
+
+In the Hadoop threat model there are **trusted elements**. Vulnerabilities that
+require the compromise of these trusted elements are outside the scope of the
+model:
+
+- **Cluster Administrators are trusted.**
+- **DNS is trusted.**
+- **The Kerberos authentication infrastructure is trusted.** Active Directory,
+ FreeIPA, or whichever other Key Distribution Center (KDC) is in use is
trusted
+ and required to be well-configured — including synchronized clocks
(NTP/chrony)
+ across the KDC, services, and clients, within the Kerberos clock-skew window.
+ Authentication failures caused by clock drift are operational bugs, not
+ vulnerabilities.
+- **The network perimeter is trusted to keep the public internet out, but the
+ wire is not assumed confidential.** The perimeter does not authenticate
callers —
+ Kerberos authentication does that at the service level; the perimeter's job
is to
+ keep the cluster off the public internet (Hadoop clusters are never
web-facing).
+ Within that, Hadoop may run with optional wire encryption (RPC `privacy` QOP;
+ HDFS block-transfer encryption). Running without encryption is by design and
not
+ a vulnerability; but when encryption is enabled, a failure to actually
protect
+ traffic — no-op encryption, silent downgrade, or MITM bypass — is in scope.
+- **Any hosting cloud or infrastructure provider is trusted, as is the
+ underlying hardware.** This includes the CPU, memory, storage, and network
+ hardware, even on shared/multi-tenant cloud systems where that hardware is
+ physically shared with other tenants. Attacks that require malicious or
+ compromised hardware, hypervisor escape, or cross-tenant side channels
+ (speculative-execution, Rowhammer, and similar) are the responsibility of the
+ hardware and infrastructure provider, and are out of scope.
+- **The underlying operating system is trusted.** Hadoop relies on OS process
+ isolation, file permissions, and (where required) OS-level disk encryption.
+ An attack that first requires the OS to be compromised or misconfigured is
out
+ of scope.
+- **Valid site configuration is trusted.** We expect `core-site.xml`,
+ `hdfs-site.xml`, `yarn-site.xml` and the rest of the site configuration to be
+ valid and to be writable only by trusted administrators. If an attacker can
+ manipulate the site configuration, the game is already over — that is out of
+ scope.
+
+Within that model, the boundary Hadoop **defends** is **privilege escalation
+across an authenticated boundary within a Kerberos-secured cluster**.
+Examples of in-scope issues are:
+
+- A user acting as another user, as a service, or as a superuser without the
+ authorization to do so.
+- Bypassing service-level authorization / ACLs
+ (see [Service Level
Authorization](hadoop-common-project/hadoop-common/src/site/markdown/ServiceLevelAuth.md)).
+- Forging, leaking, or improperly reusing delegation tokens.
+- Defeating the constraints on proxy/superuser impersonation
+ (see [Proxy user - Superusers Acting On Behalf Of Other
Users](hadoop-common-project/hadoop-common/src/site/markdown/Superusers.md)).
+
+Further properties of the model:
+
+- **Hadoop clusters are never web-facing.** They are deployed behind a network
+ perimeter; network rules are expected to keep the cluster off the public
+ internet. The perimeter does not authenticate callers — Kerberos does that at
+ the service level. A report which assumes a cluster is directly exposed to
the
+ public internet is not in scope.
+- **Wire encryption is optional and controlled by site configuration.** Network
+ traffic between Hadoop components may or may not be encrypted, depending on
the
+ deployment's configuration. The absence of wire encryption when it has not
been
+ enabled is not a vulnerability.
+
+Relevant operational security documentation:
+
+- [Hadoop in Secure
Mode](hadoop-common-project/hadoop-common/src/site/markdown/SecureMode.md)
+- [Service Level
Authorization](hadoop-common-project/hadoop-common/src/site/markdown/ServiceLevelAuth.md)
+- [Authentication for Hadoop HTTP
web-consoles](hadoop-common-project/hadoop-common/src/site/markdown/HttpAuthentication.md)
+- [Proxy user -
Superusers](hadoop-common-project/hadoop-common/src/site/markdown/Superusers.md)
+- [Credential Provider
API](hadoop-common-project/hadoop-common/src/site/markdown/CredentialProviderAPI.md)
+- [YARN Application
Security](hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md)
+- [Transparent Encryption in
HDFS](hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/TransparentEncryption.md)
+
+## Deployment Threat Model
+
+Hadoop is deployed in a number of ways, with different security boundaries.
+
+### Standalone (Insecure) Mode
+
+In its standalone configuration Hadoop performs no real authentication. Anyone
with
+network access to the cluster has full access to its data and can submit work.
+
+This mode is *intended* to run only on a trusted, network-isolated host or
+network. It is insecure **by design**. "The unsecured cluster has no security"
is
+not a vulnerability, and arbitrary data access against a non-Kerberos cluster
is
+inherent in security being disabled.
+
+It should only be used for standalone development/test environments, with
firewalls preventing remote access.
+It can then be used to test Hadoop and applications running on top of it.
+
+
+### Secure (Kerberos) Clusters
+
+This is the deployment the security model defends, as described in
+[The Hadoop threat model](#the-hadoop-threat-model) above: Kerberos
+authentication, service-level authorization, delegation tokens, and constrained
+proxy/superuser impersonation.
+
+It is the expected deployment of production physical clusters.
+1. A trusted Kerberos system is used to authenticate principals.
+2. Services have been issued with credentials (keytabs) in files, secured on
the physical hosts via OS file permissions.
+3. Users of the cluster all authenticate with the Kerberos system for their
access.
+4. Access to the cluster may be via a proxy mechanism.
+5. The HDFS filesystem uses Kerberos to authenticate HDFS nodes and services
themselves, other cluster services (YARN, Apache ZooKeeper etc) and callers.
+6. HDFS block tokens are issued by the HDFS Name Node to grant data access to
authenticated principals;
+ the possessor of a token may access a block of data on a data node with the
permissions in that token,
+ without the need to supply any further authentication information.
+
+Hadoop services issue _delegation tokens_: an authenticated principal obtains
a token directly from a service such as HDFS, Apache HBase, Apache Hive, Apache
Knox and more.
+YARN distributes these tokens to an application's containers and renews them
on the application's behalf, so tasks can authenticate to those services
without holding Kerberos credentials themselves.
+These tokens have an independent life from the Kerberos credentials
+* They have a limited lifespan of a number of hours.
+* They can be cancelled: the issuing service MUST then reject requests using
them as authentication.
+* They can be renewed: before their lifespan expires the renewer requests the
issuing service to extend their lifespan.
+
+The details of these tokens or how issuing, cancellation and renewal are
managed are not covered in this document.
+Hadoop and applications MUST safely marshall and store these tokens; if they
are published in any form then permissions are being leaked.
+
+### Transient Cloud Deployments
+
+Hadoop is frequently deployed as a transient cluster in a cloud environment:
+
+- Cloud credentials are supplied to the deployment by the hosting
infrastructure
+ — for example AWS IAM roles attached to the VMs/containers, or equivalent
+ mechanisms on other clouds. **These supplied credentials, and the access they
+ grant, are the trust boundary.** Using credentials provided to the VM or
+ container the code runs in is not a vulnerability.
+- The cluster is **transient** and typically single-tenant: it is created for a
+ workload and destroyed afterwards.
+- **Network rules prevent access by untrusted principals.** As with on-premises
+ clusters, the deployment is not web-facing; the network perimeter is part of
+ the model.
+
+Hadoop clusters MUST NOT be deployed in cloud without network rules to isolate
them from the public internet.
+
+
+## Data at Rest and Temporary Files
+
+- **Persisting data encrypted requires HDFS encryption** (see
+ [Transparent Encryption in
HDFS](hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/TransparentEncryption.md)).
+ Where encryption has been configured, a failure of the code to actually
+ encrypt the persisted data **is a vulnerability** and should be reported.
+- **Temporary data** is written to local-filesystem temporary directories. The
+ requirement is that the operating system secures and, where required,
encrypts
+ these directories — this is part of the trusted-OS assumption. Within that:
+ - Code that creates temporary files and directories **MUST create them and
set
+ their permissions atomically.** Creating a file or directory with
permissive
+ defaults and then narrowing the permissions in a later step leaves a window
+ in which another local principal can act on it.
+ - A failure to create files/directories and set their permissions atomically
+ **is an issue** and should be reported.
+
+## Secrets and Logging
+
+Leaking secrets into logs is [CWE-532: Insertion of Sensitive Information into
+Log File](https://cwe.mitre.org/data/definitions/532.html). The following rules
+apply to Hadoop code:
+
+- Secrets *SHOULD NOT* be logged.
+- **Persistent secrets, long-lived credentials, and encryption secrets (keys,
+ key material, passwords) *MUST NOT* be logged at any level.**
+- **Transient secrets** (for example short-lived tokens) *MUST NOT* be logged
at
+ `INFO`, `WARN`, or `ERROR` level, and *SHOULD NOT* be logged at `DEBUG` or
+ `TRACE` level.
+
+Transient secrets are called out specifically because secrets sometimes
surface in HTTP/web
+request logs (URLs, headers, query parameters) and are visible
+when third-party components including JDK classes are configured to log at
TRACE.
+Preventing logging of these is best-effort.
+
+
+## Development and CI Threat Model
+
+The project is built on developer systems and in CI systems, and **we do care
+about attacks on these.** Development and CI are explicitly in scope.
+
+See [Important Security Information for GitHub
Actions](.github/workflow-security.md)
+for the detailed CI/workflow security guidance. In summary:
+
+- All inputs from external pull requests — titles, comments, author metadata,
and
+ code — *SHALL* be considered untrusted, and *MUST NOT* be fed directly or
+ indirectly to shell commands without sanitization.
+- Upstream dependencies from non-ASF projects *MAY* be subverted by
supply-chain
+ attacks; a cooldown period *MUST* be observed before adopting a new or
+ updated dependency.
+- ASF projects are considered trusted, as their manual release-vote process
+ provides an implicit buffer against package-ecosystem worms, and there's an
implicit
+ level of interdependent trust between projects.
+- Maven plugins, and third-party libraries that production code compiles
against,
+ execute code on developer and CI systems (the latter during testing). Their
+ security *MUST* be evaluated before adoption.
+- IDE trust mechanisms are sensitive: for example a
+ [VS Code trusted
workspace](https://code.visualstudio.com/docs/editing/workspaces/workspace-trust)
+ allows files in the tree (`.env`, `tasks.json`, etc.) to declare executables.
+ PRs that add such code-execution mechanisms *SHALL* be rejected.
+- **CI build output is publicly visible.** Unobfuscated logging of any cloud
+ credentials or other secrets provided to CI runs is therefore in scope.
+- GitHub Actions hardening:
+ - Actions *MUST* be pinned by commit SHA (with the version as a comment so
+ Dependabot can track them), not by tag.
Review Comment:
using the sha lets dependabot manage the version updating; been doing this
with cloudstore and it works well
> Create a SECURITY.md file to define the security model for the AI tools
> -----------------------------------------------------------------------
>
> Key: HADOOP-19925
> URL: https://issues.apache.org/jira/browse/HADOOP-19925
> Project: Hadoop Common
> Issue Type: Improvement
> Components: security
> Affects Versions: 3.6.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
> Labels: pull-request-available
>
> Write a SECURITY.md file to scope AI generated security reports to sensible
> deployments, and also for humans. Base off best work of other projects.
> - explain deployments and their security boundaries (dev, kerberos, isolated
> cloud)
> - only accept security issues against kerberos
> - anything which doesn't lead to privilege escalation is a bug
> - anything which hurts perf is just a bug
> - we expect site config to be valid. If that can be manipulated, game over.
> - job submission is remote code execution so no, you don't get a CVE for that
> I will include dev and CI as targets of attacks and that we do care here.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]