[ 
https://issues.apache.org/jira/browse/HADOOP-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091793#comment-18091793
 ] 

ASF GitHub Bot commented on HADOOP-19925:
-----------------------------------------

pan3793 commented on code in PR #8562:
URL: https://github.com/apache/hadoop/pull/8562#discussion_r3480360377


##########
SECURITY.md:
##########
@@ -0,0 +1,566 @@
+SPDX-License-Identifier: Apache-2.0
+
+# Apache Hadoop Security Model
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
+NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and
+"OPTIONAL" in this document are to be interpreted as described in
+RFC 2119.
+
+This document defines the security model of Apache Hadoop: the deployments it 
is
+designed to protect, the boundaries it defends, and — equally importantly — the
+things which are *not* vulnerabilities. It exists for human reporters and for
+anyone using automated or AI-assisted tooling to look for security issues.
+
+**TL;DR: Hadoop's security model defends a Kerberos-secured cluster running on 
a
+trusted operating system, behind a network perimeter, with a valid site
+configuration. Findings which only apply outside that model are bugs, not
+vulnerabilities.**
+
+## Before Filing a Report (Including AI-Assisted Reports)
+
+The deployment Hadoop's security model defends is a **Kerberos-secured 
cluster**.
+Many findings that look like vulnerabilities in other contexts are not
+vulnerabilities here, because the surrounding deployment is trusted by design.
+
+You *MUST NOT* file a security report for:
+
+- Issues that require the operator to edit their own Hadoop site configuration,
+  place malicious files on their own classpath, or pass malicious arguments to
+  their own command invocation.
+- **Job submission running user-supplied code.** Submitting work to YARN or
+  MapReduce executes the submitter's code as the submitter's identity. That is
+  the product, not a vulnerability. See the threat model below.
+- **Denial of service at scale.** A large Hadoop cluster exists to execute jobs
+  at scale; such a cluster can itself be used to mount distributed attacks, and
+  authenticated users can exhaust resources. Resource exhaustion and 
performance
+  degradation from legitimate authenticated use are out of scope.
+- Issues that require the attacker to already hold cluster or remote-store
+  credentials, a valid Kerberos principal, or local disk access.
+- Anything against the **default insecure (non-Kerberos) mode** — it is 
insecure
+  by design (see the deployment model below).
+- **Transitive CVEs** in dependencies Hadoop builds or ships against. See
+  [Third Party Modules](#third-party-modules).
+- Raw **scanner output** (Snyk, Dependabot, Trivy, Zizmor, etc.) without a
+  reproducer against the current `trunk` branch.
+- Theoretical findings ("an attacker who could X might then Y") without a
+  reproduction.
+
+
+A valid report includes:
+
+- The Hadoop version, and ideally the git SHA it was reproduced against.
+- The exact steps, configuration, and commands used to reproduce it.
+- The observed in-scope failure, and what was expected instead.
+- Where a CVE/CVSS score is claimed, the reasoning behind that score.
+
+### For Partly/Fully AI-Generated Reports
+
+AI-assisted reports are accepted **only** if the submitter has verified the
+finding by hand against current source and includes a runnable reproducer.
+
+In addition, the submitter of an AI-generated report is
+
+1. REQUIRED to understand what Hadoop is, to understand the claimed 
vulnerability,
+and to be able to explain it in their own words — including justifying any 
claimed CVE or CVSS
+scores. If the submitter is unable to do this, then any credit for a resulting
+CVE will be assigned to the AI tool alone, and not to the submitter.
+
+2. MUST declare the AI tool used, and provide the prompt.
+   The prompt is a key part of AI tool reports, and we need to be able to 
track/replicate these.
+
+*Unverified LLM-generated reports waste maintainer time and will be closed
+without further response.*
+
+
+## Reporting a Vulnerability
+
+Report security vulnerabilities in Apache Hadoop privately to
+**[email protected]**.
+
+* Do not cc: any public mailing list.
+* Do **not** open a public JIRA issue, GitHub
+issue, or pull request for an unfixed vulnerability.
+
+For vulnerabilities in CI pipelines, see
+[Reporting Vulnerabilities in CI 
Pipelines](#reporting-vulnerabilities-in-ci-pipelines).
+
+See the Apache Software Foundation's
+[guidelines for reporting security issues](https://www.apache.org/security/) 
for
+the responsible-disclosure process that applies to all ASF projects.
+
+## Third Party Modules
+
+### Reporting a Known CVE in a Hadoop Dependency
+
+Do not report the existence of a published CVE in a Hadoop dependency
+to the security list. These are published and do not need to be treated as
+confidential.
+
+These are considered improvements in the project, and are managed in
+the project's [issue tracker](https://issues.apache.org/jira/issues/).
+1. Search for any existing issue covering the dependency upgrade.
+2. If it exists, read it, its discussion, the PRs etc, and see what versions
+   it has been merged to.
+3. If it hasn't been merged, look at why and get involved: major work is 
likely to be
+   needed.
+4. If there isn't an issue, create one and start work on the PR!
+
+Tip: an easy way to check for the version of a library to ship in the trunk
+release of hadoop is the [LICENSE-binary](./LICENSE-binary) file.
+
+Please do not send an email listing the CVEs an automated scan
+tool reported and requesting updates, timelines etc.
+Open source development is a community process, and addressing this is done
+in the [developer mailing lists](https://hadoop.apache.org/mailing_lists.html).
+Join the community to help get your needs addressed.
+
+### Providing Advance Warning of a Critical CVE in a Hadoop Dependency
+
+If a team providing a library which Hadoop bundles has a critical CVE which
+a forthcoming fix will correct, they are encouraged to notify the hadoop 
security
+list so we can co-ordinate releases.
+
+We treat all such reports as confidential.
+
+### Reporting a Newly-Discovered Vulnerability in a Third-Party Module
+
+Security bugs in third-party modules (the JVM, the Kerberos infrastructure, 
cloud
+SDKs, connectors, or any other dependency) should be reported to their 
respective
+maintainers, through their own security-reporting mechanisms — after verifying
+the issue is in scope of *their* threat model and reproduces against *their*
+current release.
+
+## Supported Versions
+
+Security fixes are made only to the most recent Apache Hadoop release line(s).
+Older release lines are end-of-life and do not receive security updates; the
+remedy for a vulnerability in an old line is to upgrade. Refer to the
+[Apache Hadoop release and download 
policy](https://hadoop.apache.org/releases.html)
+for which lines are currently maintained. A report MUST be reproducible 
against a
+maintained release or the current `trunk` branch.
+
+## The Hadoop Threat Model
+
+In the Hadoop threat model there are **trusted elements**. Vulnerabilities that
+require the compromise of these trusted elements are outside the scope of the
+model:
+
+- **Cluster Administrators are trusted.**
+- **DNS is trusted.**
+- **The Kerberos authentication infrastructure is trusted.** Active Directory,
+  FreeIPA, or whichever other Key Distribution Center (KDC) is in use is 
trusted
+  and required to be well-configured — including synchronized clocks 
(NTP/chrony)
+  across the KDC, services, and clients, within the Kerberos clock-skew window.
+  Authentication failures caused by clock drift are operational bugs, not
+  vulnerabilities.
+- **The network perimeter is trusted to keep the public internet out, but the
+  wire is not assumed confidential.** The perimeter does not authenticate 
callers —
+  Kerberos authentication does that at the service level; the perimeter's job 
is to
+  keep the cluster off the public internet (Hadoop clusters are never 
web-facing).
+  Within that, Hadoop may run with optional wire encryption (RPC `privacy` QOP;
+  HDFS block-transfer encryption). Running without encryption is by design and 
not
+  a vulnerability; but when encryption is enabled, a failure to actually 
protect
+  traffic — no-op encryption, silent downgrade, or MITM bypass — is in scope.
+- **Any hosting cloud or infrastructure provider is trusted, as is the
+  underlying hardware.** This includes the CPU, memory, storage, and network
+  hardware, even on shared/multi-tenant cloud systems where that hardware is
+  physically shared with other tenants. Attacks that require malicious or
+  compromised hardware, hypervisor escape, or cross-tenant side channels
+  (speculative-execution, Rowhammer, and similar) are the responsibility of the
+  hardware and infrastructure provider, and are out of scope.
+- **The underlying operating system is trusted.** Hadoop relies on OS process
+  isolation, file permissions, and (where required) OS-level disk encryption.
+  An attack that first requires the OS to be compromised or misconfigured is 
out
+  of scope.
+- **Valid site configuration is trusted.** We expect `core-site.xml`,
+  `hdfs-site.xml`, `yarn-site.xml` and the rest of the site configuration to be
+  valid and to be writable only by trusted administrators. If an attacker can
+  manipulate the site configuration, the game is already over — that is out of
+  scope.
+
+Within that model, the boundary Hadoop **defends** is **privilege escalation
+across an authenticated boundary within a Kerberos-secured cluster**.
+Examples of in-scope issues are:
+
+- A user acting as another user, as a service, or as a superuser without the
+  authorization to do so.
+- Bypassing service-level authorization / ACLs
+  (see [Service Level 
Authorization](hadoop-common-project/hadoop-common/src/site/markdown/ServiceLevelAuth.md)).
+- Forging, leaking, or improperly reusing delegation tokens.
+- Defeating the constraints on proxy/superuser impersonation
+  (see [Proxy user - Superusers Acting On Behalf Of Other 
Users](hadoop-common-project/hadoop-common/src/site/markdown/Superusers.md)).
+
+Further properties of the model:
+
+- **Hadoop clusters are never web-facing.** They are deployed behind a network
+  perimeter; network rules are expected to keep the cluster off the public
+  internet. The perimeter does not authenticate callers — Kerberos does that at
+  the service level. A report which assumes a cluster is directly exposed to 
the
+  public internet is not in scope.
+- **Wire encryption is optional and controlled by site configuration.** Network
+  traffic between Hadoop components may or may not be encrypted, depending on 
the
+  deployment's configuration. The absence of wire encryption when it has not 
been
+  enabled is not a vulnerability.
+
+Relevant operational security documentation:
+
+- [Hadoop in Secure 
Mode](hadoop-common-project/hadoop-common/src/site/markdown/SecureMode.md)
+- [Service Level 
Authorization](hadoop-common-project/hadoop-common/src/site/markdown/ServiceLevelAuth.md)
+- [Authentication for Hadoop HTTP 
web-consoles](hadoop-common-project/hadoop-common/src/site/markdown/HttpAuthentication.md)
+- [Proxy user - 
Superusers](hadoop-common-project/hadoop-common/src/site/markdown/Superusers.md)
+- [Credential Provider 
API](hadoop-common-project/hadoop-common/src/site/markdown/CredentialProviderAPI.md)
+- [YARN Application 
Security](hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md)
+- [Transparent Encryption in 
HDFS](hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/TransparentEncryption.md)
+
+## Deployment Threat Model
+
+Hadoop is deployed in a number of ways, with different security boundaries.
+
+### Standalone (Insecure) Mode
+
+In its standalone configuration Hadoop performs no real authentication. Anyone 
with
+network access to the cluster has full access to its data and can submit work.
+
+This mode is *intended* to run only on a trusted, network-isolated host or
+network. It is insecure **by design**. "The unsecured cluster has no security" 
is
+not a vulnerability, and arbitrary data access against a non-Kerberos cluster 
is
+inherent in security being disabled.
+
+It should only be used for standalone development/test environments, with 
firewalls preventing remote access.
+It can then be used to test Hadoop and applications running on top of it.
+
+
+### Secure (Kerberos) Clusters
+
+This is the deployment the security model defends, as described in
+[The Hadoop threat model](#the-hadoop-threat-model) above: Kerberos
+authentication, service-level authorization, delegation tokens, and constrained
+proxy/superuser impersonation.
+
+It is the expected deployment of production physical clusters.
+1. A trusted Kerberos system is used to authenticate principals.
+2. Services have been issued with credentials (keytabs) in files, secured on 
the physical hosts via OS file permissions.
+3. Users of the cluster all authenticate with the Kerberos system for their 
access.
+4. Access to the cluster may be via a proxy mechanism.
+5. The HDFS filesystem uses Kerberos to authenticate HDFS nodes and services 
themselves, other cluster services (YARN, Apache ZooKeeper etc) and callers.
+6. HDFS block tokens are issued by the HDFS Name Node to grant data access to 
authenticated principals;
+   the possessor of a token may access a block of data on a data node with the 
permissions in that token,
+   without the need to supply any further authentication information.
+
+Hadoop services issue _delegation tokens_: an authenticated principal obtains 
a token directly from a service such as HDFS, Apache HBase, Apache Hive, Apache 
Knox and more.
+YARN distributes these tokens to an application's containers and renews them 
on the application's behalf, so tasks can authenticate to those services 
without holding Kerberos credentials themselves.
+These tokens have an independent life from the Kerberos credentials
+* They have a limited lifespan of a number of hours.
+* They can be cancelled: the issuing service MUST then reject requests using 
them as authentication.
+* They can be renewed: before their lifespan expires the renewer requests the 
issuing service to extend their lifespan.
+
+The details of these tokens or how issuing, cancellation and renewal are 
managed are not covered in this document.
+Hadoop and applications MUST safely marshall and store these tokens; if they 
are published in any form then permissions are being leaked.
+
+### Transient Cloud Deployments
+
+Hadoop is frequently deployed as a transient cluster in a cloud environment:
+
+- Cloud credentials are supplied to the deployment by the hosting 
infrastructure
+  — for example AWS IAM roles attached to the VMs/containers, or equivalent
+  mechanisms on other clouds. **These supplied credentials, and the access they
+  grant, are the trust boundary.** Using credentials provided to the VM or
+  container the code runs in is not a vulnerability.
+- The cluster is **transient** and typically single-tenant: it is created for a
+  workload and destroyed afterwards.
+- **Network rules prevent access by untrusted principals.** As with on-premises
+  clusters, the deployment is not web-facing; the network perimeter is part of
+  the model.
+
+Hadoop clusters MUST NOT be deployed in cloud without network rules to isolate 
them from the public internet.
+
+
+## Data at Rest and Temporary Files
+
+- **Persisting data encrypted requires HDFS encryption** (see
+  [Transparent Encryption in 
HDFS](hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/TransparentEncryption.md)).
+  Where encryption has been configured, a failure of the code to actually
+  encrypt the persisted data **is a vulnerability** and should be reported.
+- **Temporary data** is written to local-filesystem temporary directories. The
+  requirement is that the operating system secures and, where required, 
encrypts
+  these directories — this is part of the trusted-OS assumption. Within that:
+  - Code that creates temporary files and directories **MUST create them and 
set
+    their permissions atomically.** Creating a file or directory with 
permissive
+    defaults and then narrowing the permissions in a later step leaves a window
+    in which another local principal can act on it.
+  - A failure to create files/directories and set their permissions atomically
+    **is an issue** and should be reported.
+
+## Secrets and Logging
+
+Leaking secrets into logs is [CWE-532: Insertion of Sensitive Information into
+Log File](https://cwe.mitre.org/data/definitions/532.html). The following rules
+apply to Hadoop code:
+
+- Secrets *SHOULD NOT* be logged.
+- **Persistent secrets, long-lived credentials, and encryption secrets (keys,
+  key material, passwords) *MUST NOT* be logged at any level.**
+- **Transient secrets** (for example short-lived tokens) *MUST NOT* be logged 
at
+  `INFO`, `WARN`, or `ERROR` level, and *SHOULD NOT* be logged at `DEBUG` or
+  `TRACE` level.
+
+Transient secrets are called out specifically because secrets sometimes 
surface in HTTP/web
+request logs (URLs, headers, query parameters) and are visible
+when third-party components including JDK classes are configured to log at 
TRACE.
+Preventing logging of these is best-effort.
+
+
+## Development and CI Threat Model
+
+The project is built on developer systems and in CI systems, and **we do care
+about attacks on these.** Development and CI are explicitly in scope.
+
+See [Important Security Information for GitHub 
Actions](.github/workflow-security.md)
+for the detailed CI/workflow security guidance. In summary:
+
+- All inputs from external pull requests — titles, comments, author metadata, 
and
+  code — *SHALL* be considered untrusted, and *MUST NOT* be fed directly or
+  indirectly to shell commands without sanitization.
+- Upstream dependencies from non-ASF projects *MAY* be subverted by 
supply-chain
+  attacks; a cooldown period *MUST* be observed before adopting a new or
+  updated dependency.
+- ASF projects are considered trusted, as their manual release-vote process
+  provides an implicit buffer against package-ecosystem worms, and there's an 
implicit
+  level of interdependent trust between projects.
+- Maven plugins, and third-party libraries that production code compiles 
against,
+  execute code on developer and CI systems (the latter during testing). Their
+  security *MUST* be evaluated before adoption.
+- IDE trust mechanisms are sensitive: for example a
+  [VS Code trusted 
workspace](https://code.visualstudio.com/docs/editing/workspaces/workspace-trust)
+  allows files in the tree (`.env`, `tasks.json`, etc.) to declare executables.
+  PRs that add such code-execution mechanisms *SHALL* be rejected.
+- **CI build output is publicly visible.** Unobfuscated logging of any cloud
+  credentials or other secrets provided to CI runs is therefore in scope.
+- GitHub Actions hardening:
+  - Actions *MUST* be pinned by commit SHA (with the version as a comment so
+    Dependabot can track them), not by tag.

Review Comment:
   over strict? it should be fine since we are only allowed to use approved 
Actions https://github.com/apache/infrastructure-actions





> Create a SECURITY.md file to define the security model for the AI tools
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-19925
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19925
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: security
>    Affects Versions: 3.6.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>
> Write a SECURITY.md file to scope AI generated security reports to sensible 
> deployments, and also for humans. Base off best work of other projects.
> - explain deployments and their security boundaries (dev, kerberos, isolated 
> cloud)
> - only accept security issues against kerberos
> - anything which doesn't lead to privilege escalation is a bug
> - anything which hurts perf is just a bug
> - we expect site config to be valid. If that can be manipulated, game over.
> - job submission is remote code execution so no, you don't get a CVE for that
> I will include dev and CI as targets of attacks and that we do care here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to