(spark) branch branch-4.0 updated: [SPARK-56998] Add SECURITY.md + AGENTS.md Security section for scan-agent discoverability

gurwls223 Fri, 22 May 2026 13:31:49 -0700

This is an automated email from the ASF dual-hosted git repository.

HyukjinKwon pushed a commit to branch branch-4.0
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-4.0 by this push:
     new c22058115253 [SPARK-56998] Add SECURITY.md + AGENTS.md Security 
section for scan-agent discoverability
c22058115253 is described below

commit c220581152536939d6de7ba818077aa995629d17
Author: Jarek Potiuk <[email protected]>
AuthorDate: Fri May 22 13:25:05 2026 -0700

    [SPARK-56998] Add SECURITY.md + AGENTS.md Security section for scan-agent 
discoverability
    
    **This is a proposal for the PMC to review — please correct, reject, or 
discuss as needed.** Nothing here is a requirement; the maintainer is the 
decision-maker.
    
    This adds a `SECURITY.md` to the repo root and a `Security` section to the 
existing `AGENTS.md` so an automated scan agent can mechanically discover the 
project's security model via the conventional `AGENTS.md → SECURITY.md → model 
URL` chain. The chain terminates at the existing 
<https://spark.apache.org/docs/latest/security.html> page — nothing about the 
model content itself changes.
    
    Context: the ASF Security team is preparing the project for an automated 
agentic security scan we're piloting. Such scans refuse to run if the model 
isn't discoverable by that path (refusing upfront beats wasting PMC reviewer 
cycles on a noise-heavy run against an unknown model). Discoverability is the 
one hard gate; everything else is suggestion. The Security team has reached out 
separately on the PMC's private list with the program details; this PR is the 
public-facing repo piece.
    
    The Security team uses 
[`threat-model-producer`](https://gist.github.com/potiuk/da14a826283038ddfe38cc9fe6310573)
 as the rubric for what a complete model looks like — but this PR is just the 
*link*; the existing `security.html` content is accepted as the model.
    
    After this lands on `master`, the same two files would need to be on 
`branch-3.5` for the second scan target — happy to open a cherry-pick PR for 
that, or leave it to the PMC.
    
    Questions / pushback welcome. Happy to adjust the wording or move the 
section if the project has a house style.
    
    Closes #55933 from potiuk/asf-security/discoverability-2026-05-18.
    
    Lead-authored-by: Jarek Potiuk <[email protected]>
    Co-authored-by: Xiao Li <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    (cherry picked from commit 411dedc1aa98a5a1aeacfd9b27487583f18eaab1)
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 AGENTS.md   | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 SECURITY.md |  13 +++++
 2 files changed, 176 insertions(+)

diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 000000000000..c37d8a130421
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,163 @@
+# Apache Spark
+
+## Pre-flight Checks
+
+Before the first code edit or running test in a session, ensure a clean 
working environment. DO NOT skip these checks:
+
+1. Run `git remote -v` to identify the personal fork and upstream 
(`apache/spark`). If unclear, ask the user to configure their remotes following 
the standard convention (`origin` for the fork, `upstream` for `apache/spark`).
+2. If the latest commit on `<upstream>/master` is more than a day old (check 
with `git log -1 --format="%ci" <upstream>/master`), run `git fetch <upstream> 
master`.
+3. If there are uncommitted changes (check with `git status`), ask the user to 
stash them before proceeding.
+4. Switch to the appropriate branch:
+   - **Existing PR**: resolve the PR branch name via `gh api 
repos/apache/spark/pulls/<number> --jq '.head.ref'`, then look for a local 
branch matching that name. If found, switch to it and inform the user. If not 
found, ask whether to fetch it or if there is a local branch under a different 
name.
+   - **New edits**: ask the user to choose: create a new git worktree from 
`<upstream>/master` and work from there (recommended), or create and switch to 
a new branch from `<upstream>/master`.
+   - **Running tests**: use `<upstream>/master`.
+
+## Development Notes
+
+SQL golden file tests are managed by `SQLQueryTestSuite` and its variants. 
Read the class documentation before running or updating these tests. DO NOT 
edit the generated golden files (`.sql.out`) directly. Always regenerate them 
when needed, and carefully review the diff to make sure it's expected.
+
+Spark Connect protocol is defined in proto files under 
`sql/connect/common/src/main/protobuf/`. Read the README there before modifying 
proto definitions.
+
+Avoid introducing non-ASCII characters in code or comments. String literals 
may contain non-ASCII when the content requires it (error messages, test data, 
etc.). Identifiers are ASCII by convention. The common failure mode is 
typographic characters (em-dash, smart quotes, ellipsis, non-breaking space) 
sneaking into comments; scalastyle flags some of these. Spot-check before 
committing: `grep -rn -P "[^\x00-\x7F]" <files>`.
+
+## Scala Test Base Classes
+
+When writing a new Scala test suite, pick the lowest base class that provides 
what the test actually needs. Spark uses the `AnyFunSuite` ScalaTest style 
throughout, so the bases below are the chain to choose from. Each adds 
capability on top of the previous:
+
+    SparkFunSuite                                                           
(core)
+      <- PlanTest                                                           
(sql/catalyst)
+        <- QueryTest                                                        
(sql/core)
+
+| Test scope | Base | Notes |
+|------------|------|-------|
+| Plain JVM/Scala — no Spark SQL | `SparkFunSuite` | `core` utilities, RDD, 
network, util classes, etc. Adds per-test timeout, `testRetry`, `gridTest`, 
thread audit, fixed timezone/locale, `withTempDir`, `withLogAppender`, 
`checkError`. |
+| Catalyst plan tests — no `SparkSession` | `PlanTest` | Adds `comparePlans`, 
`normalizePlan`, `normalizeExprIds`. For analyzer / optimizer / planner rule 
tests. |
+| SQL/DataFrame tests — needs a `SparkSession` | `QueryTest` | Adds 
`checkAnswer`, codegen-on/off helpers. `spark: SparkSession` is abstract and 
must be supplied by a session-providing trait (see below). |
+
+### Providing a `SparkSession` for `QueryTest`
+
+`QueryTest` declares `spark: SparkSession` abstractly via 
`SparkSessionProvider`, so it cannot be instantiated on its own. A concrete 
suite mixes in one of the session-providing traits below:
+
+    QueryTest                                                               
(abstract `spark`)
+      + SharedSparkSession (sql/core)        -> classic in-process 
`TestSparkSession`
+      + TestHiveSingleton  (sql/hive)        -> Hive-backed `TestHive` session
+
+| Session provider | Module / location | Typical usage |
+|---|---|---|
+| `SharedSparkSession` | `sql/core` | Already extends `QueryTest` for 
historical reasons, but still mix in `QueryTest` explicitly, e.g. `class X 
extends QueryTest with SharedSparkSession`. Default for tests under `sql/core`. 
|
+| `TestHiveSingleton` | `sql/hive` | Mixed in alongside `QueryTest`, e.g. 
`class X extends QueryTest with TestHiveSingleton`. Used by tests under 
`sql/hive`. |
+
+## Build and Test
+
+Build and tests can take a long time. If the user explicitly asked to run 
tests, run them. Otherwise (you are running tests on your own to verify a 
change), first ask the user if they have more changes to make.
+
+Prefer SBT over Maven for faster incremental compilation. Module names are 
defined in `project/SparkBuild.scala`.
+
+Compile a single module:
+
+    build/sbt <module>/compile
+
+Compile test code for a single module:
+
+    build/sbt <module>/Test/compile
+
+Run test suites by wildcard or full class name:
+
+    build/sbt '<module>/testOnly *MySuite'
+    build/sbt '<module>/testOnly org.apache.spark.sql.MySuite'
+
+Run test cases matching a substring:
+
+    build/sbt '<module>/testOnly *MySuite -- -z "test name"'
+
+For faster iteration, keep SBT open in interactive mode:
+
+    build/sbt
+    > project <module>
+    > testOnly *MySuite
+
+### PySpark Tests
+
+PySpark tests require building Spark with Hive support first:
+
+    build/sbt -Phive package
+
+Activate the virtual environment specified by the user, or default to `.venv`:
+
+    source <venv>/bin/activate
+
+If the default venv does not exist, create it:
+
+    python3 -m venv .venv
+    source .venv/bin/activate
+    pip install -r dev/requirements.txt
+
+Run a single test suite:
+
+    python/run-tests --testnames pyspark.sql.tests.arrow.test_arrow
+
+Run a single test case:
+
+    python/run-tests --testnames "pyspark.sql.tests.test_catalog 
CatalogTests.test_current_database"
+
+## Investigating PR CI Failures
+
+Do NOT download full job logs to grep for errors — they are very large and 
slow. Instead, use the test report annotations on the fork.
+
+Step 1 — Get the fork owner and the latest commit SHA of the PR:
+
+    gh api repos/apache/spark/pulls/<PR_NUMBER> --jq '{owner: 
.head.repo.owner.login, sha: .head.sha}'
+
+Step 2 — Find the "Report test results" check run on the fork's commit:
+
+    gh api repos/<OWNER>/spark/commits/<SHA>/check-runs \
+      --jq '.check_runs[] | select(.name == "Report test results") | {id: .id, 
annotations: .output.annotations_count}'
+
+Step 3 — Fetch failure annotations:
+
+    gh api repos/<OWNER>/spark/check-runs/<CHECK_RUN_ID>/annotations
+
+Each annotation contains the test class, test name, and failure message.
+
+## Pull Request Workflow
+
+PR title format is `[SPARK-xxxx][COMPONENT] Title`. The component tag is 
derived from the JIRA component name: take the last word and uppercase it (e.g. 
`Project Infra` → `[INFRA]`, `Spark Core` → `[CORE]`, `Structured Streaming` → 
`[STREAMING]`, `SQL` → `[SQL]`).
+
+Infer the PR title from the changes. If no ticket ID is given, create one 
using `dev/create_spark_jira.py`, using the PR title (without the JIRA ID and 
component tag) as the ticket title.
+
+    python3 dev/create_spark_jira.py "<title>" -c <component> { -t <type> | -p 
<parent-jira-id> }
+
+- **Component** (`-c`): the exact JIRA component name (not the PR title 
shorthand), e.g. "SQL", "Spark Core", "PySpark", "Connect". Run `python3 
dev/create_spark_jira.py --list-components` for the full list.
+- **Issue type** (`-t`): "Bug", "Improvement", "New Feature", "Test", 
"Documentation", or "Dependency upgrade".
+- **Parent** (`-p`): if the user mentions a parent JIRA ticket (e.g., "this is 
a subtask of SPARK-12345"), pass it instead of `-t`. The issue type is 
automatically "Sub-task".
+
+The script sets the latest unreleased version as the default affected version.
+
+After creating a JIRA ticket, print a prominent notice so the user does not 
miss it:
+
+    ============================================================
+    JIRA ticket created: SPARK-XXXXX
+    https://issues.apache.org/jira/browse/SPARK-XXXXX
+
+    Title:              <title>
+    Component(s):       <component>
+    Issue type:         <type>
+    Affected version(s): <version>
+    Priority:           <priority>
+
+    Please review and adjust these fields if needed.
+    ============================================================
+
+Before writing the PR description, read `.github/PULL_REQUEST_TEMPLATE` and 
fill in every section from that file.
+
+DO NOT push to the upstream repo. Always push to the personal fork. Open PRs 
against `master` on the upstream repo.
+
+DO NOT force push or use `--amend` on pushed commits unless the user 
explicitly asks. If the remote branch has new commits, fetch and rebase before 
pushing.
+
+Always get user approval before external operations such as pushing commits, 
creating PRs, or posting comments. Use `gh pr create` to open PRs. If `gh` is 
not installed, generate the GitHub PR URL for the user and recommend installing 
the GitHub CLI.
+
+## Security
+
+Security model: [SECURITY.md](./SECURITY.md)
+
+Agents that scan this repository should consult `SECURITY.md` for the 
project's threat model, in-scope / out-of-scope declarations, and known 
non-findings before reporting issues.
diff --git a/SECURITY.md b/SECURITY.md
new file mode 100644
index 000000000000..26b540d81ab5
--- /dev/null
+++ b/SECURITY.md
@@ -0,0 +1,13 @@
+# Security policy
+
+## Reporting a vulnerability
+
+To report a security vulnerability in Apache Spark, follow the ASF security 
process at <https://www.apache.org/security/>. Reports can be sent privately to 
[email protected] or to [email protected].
+
+## Threat model
+
+Apache Spark's security guidance for operators — covering authentication, 
encryption, access control, Kerberos integration, trust boundaries, 
environmental assumptions, security properties provided and not provided, and 
downstream operator responsibilities — is documented at:
+
+<https://spark.apache.org/docs/latest/security.html>
+
+Automated security scanning agents should consult that document for the 
project's in-scope / out-of-scope declarations before reporting issues.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch branch-4.0 updated: [SPARK-56998] Add SECURITY.md + AGENTS.md Security section for scan-agent discoverability

Reply via email to