This is an automated email from the ASF dual-hosted git repository.
allisonwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new d8f2f95a2bb2 [SPARK-53666][DOCS] Add script to generate llms.txt file
for Spark main website
d8f2f95a2bb2 is described below
commit d8f2f95a2bb276a51d2b4efcd5f243e9c28c7b97
Author: Allison Wang <[email protected]>
AuthorDate: Mon Nov 3 15:10:24 2025 -0800
[SPARK-53666][DOCS] Add script to generate llms.txt file for Spark main
website
### What changes were proposed in this pull request?
This PR adds the initial script to generate llms.txt file for Spark main
documentation website.
Note, for **API Docs**, they should point to their own llms.txt files once
available.
Here is the current llms.txt file generated by this script.
```md
# Apache Spark
> Apache Spark™ is a unified analytics engine for large-scale data
processing. It provides high-level APIs in Java, Scala, Python, and R, and an
optimized engine that supports general execution graphs. It also supports a
rich set of higher-level tools including Spark SQL for SQL and structured data
processing, MLlib for machine learning, GraphX for graph processing, and
Structured Streaming for incremental computation and stream processing.
Documentation home: https://spark.apache.org/docs/latest/
## Programming Guides
- [Quick Start](https://spark.apache.org/docs/latest/quick-start.html)
- [RDD Programming
Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
- [Spark SQL, Datasets, and
DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- [Structured
Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
- [Spark
Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html)
- [MLlib](https://spark.apache.org/docs/latest/ml-guide.html)
-
[GraphX](https://spark.apache.org/docs/latest/graphx-programming-guide.html)
- [SparkR](https://spark.apache.org/docs/latest/sparkr.html)
-
[PySpark](https://spark.apache.org/docs/latest/api/python/getting_started/index.html)
- [Spark SQL
CLI](https://spark.apache.org/docs/latest/sql-distributed-sql-engine-spark-sql-cli.html)
## API Docs
- [Spark Python
API](https://spark.apache.org/docs/latest/api/python/index.html)
- [Spark Scala
API](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html)
- [Spark Java API](https://spark.apache.org/docs/latest/api/java/index.html)
- [Spark R API](https://spark.apache.org/docs/latest/api/R/index.html)
- [Spark SQL Built-in
Functions](https://spark.apache.org/docs/latest/api/sql/index.html)
## Deployment Guides
- [Cluster
Overview](https://spark.apache.org/docs/latest/cluster-overview.html)
- [Submitting
Applications](https://spark.apache.org/docs/latest/submitting-applications.html)
- [Standalone Deploy
Mode](https://spark.apache.org/docs/latest/spark-standalone.html)
- [YARN](https://spark.apache.org/docs/latest/running-on-yarn.html)
-
[Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html)
## Other Documents
- [Configuration](https://spark.apache.org/docs/latest/configuration.html)
- [Monitoring](https://spark.apache.org/docs/latest/monitoring.html)
- [Web UI](https://spark.apache.org/docs/latest/web-ui.html)
- [Tuning Guide](https://spark.apache.org/docs/latest/tuning.html)
- [Job Scheduling](https://spark.apache.org/docs/latest/job-scheduling.html)
- [Security](https://spark.apache.org/docs/latest/security.html)
- [Hardware
Provisioning](https://spark.apache.org/docs/latest/hardware-provisioning.html)
- [Cloud
Infrastructures](https://spark.apache.org/docs/latest/cloud-integration.html)
- [Migration
Guide](https://spark.apache.org/docs/latest/migration-guide.html)
- [Building Spark](https://spark.apache.org/docs/latest/building-spark.html)
## External Resources
- [Apache Spark Home](https://spark.apache.org/)
- [Downloads](https://spark.apache.org/downloads.html)
- [GitHub Repository](https://github.com/apache/spark)
- [Issue Tracker (JIRA)](https://issues.apache.org/jira/projects/SPARK)
- [Mailing Lists](https://spark.apache.org/mailing-lists.html)
- [Community](https://spark.apache.org/community.html)
- [Contributing](https://spark.apache.org/contributing.html)
```
### Why are the changes needed?
To improve documentations
### Does this PR introduce _any_ user-facing change?
No. This PR along will not add the newly generated llms.txt files to the
website.
### How was this patch tested?
Manually running locally.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #52412 from allisonwang-db/SPARK-53666-llms-txt.
Authored-by: Allison Wang <[email protected]>
Signed-off-by: Allison Wang <[email protected]>
---
dev/create-release/generate-llms-txt.py | 206 ++++++++++++++++++++++++++++++++
dev/create-release/release-build.sh | 8 ++
2 files changed, 214 insertions(+)
diff --git a/dev/create-release/generate-llms-txt.py
b/dev/create-release/generate-llms-txt.py
new file mode 100755
index 000000000000..604d3f559a49
--- /dev/null
+++ b/dev/create-release/generate-llms-txt.py
@@ -0,0 +1,206 @@
+#!/usr/bin/env python3
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# This script generates llms.txt file for Apache Spark documentation
+
+import sys
+import argparse
+from pathlib import Path
+
+
+def generate_llms_txt(docs_path: Path, output_path: Path, version: str =
"latest") -> None:
+ """
+ Generate the llms.txt file for Apache Spark documentation with hardcoded
categories.
+ """
+ content = []
+ content.append("# Apache Spark")
+ content.append("")
+ content.append(
+ "> Apache Spark™ is a unified analytics engine for large-scale data
processing. "
+ "It provides high-level APIs in Java, Scala, Python, and R, and an
optimized engine "
+ "that supports general execution graphs. It also supports a rich set
of higher-level "
+ "tools including Spark SQL for SQL and structured data processing,
MLlib for machine "
+ "learning, GraphX for graph processing, and Structured Streaming for
incremental "
+ "computation and stream processing."
+ )
+ content.append("")
+
+ doc_home_url = f"https://spark.apache.org/docs/{version}/"
+ content.append(f"Documentation home: {doc_home_url}")
+ content.append("")
+
+ content.append("## Programming Guides")
+ content.append("")
+ programming_guides = [
+ ("Quick Start",
f"https://spark.apache.org/docs/{version}/quick-start.html"),
+ (
+ "RDD Programming Guide",
+
f"https://spark.apache.org/docs/{version}/rdd-programming-guide.html",
+ ),
+ (
+ "Spark SQL, Datasets, and DataFrames",
+
f"https://spark.apache.org/docs/{version}/sql-programming-guide.html",
+ ),
+ (
+ "Structured Streaming",
+
f"https://spark.apache.org/docs/{version}/structured-streaming-programming-guide.html",
+ ),
+ (
+ "Spark Streaming",
+
f"https://spark.apache.org/docs/{version}/streaming-programming-guide.html",
+ ),
+ ("MLlib", f"https://spark.apache.org/docs/{version}/ml-guide.html"),
+ ("GraphX",
f"https://spark.apache.org/docs/{version}/graphx-programming-guide.html"),
+ ("SparkR", f"https://spark.apache.org/docs/{version}/sparkr.html"),
+ (
+ "PySpark",
+
f"https://spark.apache.org/docs/{version}/api/python/getting_started/index.html",
+ ),
+ (
+ "Spark SQL CLI",
+ f"https://spark.apache.org/docs/{version}/"
+ f"sql-distributed-sql-engine-spark-sql-cli.html",
+ ),
+ ]
+ for title, url in programming_guides:
+ content.append(f"- [{title}]({url})")
+ content.append("")
+
+ content.append("## API Docs")
+ content.append("")
+ # TODO: Update API docs to point to their own llms.txt files once available
+ # e.g., https://spark.apache.org/docs/{version}/api/python/llms.txt
+ api_docs = [
+ ("Spark Python API",
f"https://spark.apache.org/docs/{version}/api/python/index.html"),
+ (
+ "Spark Scala API",
+
f"https://spark.apache.org/docs/{version}/api/scala/org/apache/spark/index.html",
+ ),
+ ("Spark Java API",
f"https://spark.apache.org/docs/{version}/api/java/index.html"),
+ ("Spark R API",
f"https://spark.apache.org/docs/{version}/api/R/index.html"),
+ (
+ "Spark SQL Built-in Functions",
+ f"https://spark.apache.org/docs/{version}/api/sql/index.html",
+ ),
+ ]
+ for title, url in api_docs:
+ content.append(f"- [{title}]({url})")
+ content.append("")
+
+ content.append("## Deployment Guides")
+ content.append("")
+ deployment_guides = [
+ ("Cluster Overview",
f"https://spark.apache.org/docs/{version}/cluster-overview.html"),
+ (
+ "Submitting Applications",
+
f"https://spark.apache.org/docs/{version}/submitting-applications.html",
+ ),
+ (
+ "Standalone Deploy Mode",
+ f"https://spark.apache.org/docs/{version}/spark-standalone.html",
+ ),
+ ("YARN",
f"https://spark.apache.org/docs/{version}/running-on-yarn.html"),
+ ("Kubernetes",
f"https://spark.apache.org/docs/{version}/running-on-kubernetes.html"),
+ ]
+ for title, url in deployment_guides:
+ content.append(f"- [{title}]({url})")
+ content.append("")
+
+ content.append("## Other Documents")
+ content.append("")
+ other_docs = [
+ ("Configuration",
f"https://spark.apache.org/docs/{version}/configuration.html"),
+ ("Monitoring",
f"https://spark.apache.org/docs/{version}/monitoring.html"),
+ ("Web UI", f"https://spark.apache.org/docs/{version}/web-ui.html"),
+ ("Tuning Guide",
f"https://spark.apache.org/docs/{version}/tuning.html"),
+ ("Job Scheduling",
f"https://spark.apache.org/docs/{version}/job-scheduling.html"),
+ ("Security", f"https://spark.apache.org/docs/{version}/security.html"),
+ (
+ "Hardware Provisioning",
+
f"https://spark.apache.org/docs/{version}/hardware-provisioning.html",
+ ),
+ (
+ "Cloud Infrastructures",
+ f"https://spark.apache.org/docs/{version}/cloud-integration.html",
+ ),
+ ("Migration Guide",
f"https://spark.apache.org/docs/{version}/migration-guide.html"),
+ ("Building Spark",
f"https://spark.apache.org/docs/{version}/building-spark.html"),
+ ]
+ for title, url in other_docs:
+ content.append(f"- [{title}]({url})")
+ content.append("")
+
+ content.append("## External Resources")
+ content.append("")
+ content.append("- [Apache Spark Home](https://spark.apache.org/)")
+ content.append("- [Downloads](https://spark.apache.org/downloads.html)")
+ content.append("- [GitHub Repository](https://github.com/apache/spark)")
+ content.append("- [Issue Tracker
(JIRA)](https://issues.apache.org/jira/projects/SPARK)")
+ content.append("- [Mailing
Lists](https://spark.apache.org/mailing-lists.html)")
+ content.append("- [Community](https://spark.apache.org/community.html)")
+ content.append("-
[Contributing](https://spark.apache.org/contributing.html)")
+ content.append("")
+
+ with open(output_path, "w", encoding="utf-8") as f:
+ f.write("\n".join(content))
+
+ print(f"Generated {output_path}")
+
+ total_docs = len(programming_guides) + len(api_docs) +
len(deployment_guides) + len(other_docs)
+ sections_count = 5
+
+ print(f"Total documentation pages indexed: {total_docs}")
+ print(f"Sections: {sections_count}")
+
+
+def main():
+ parser = argparse.ArgumentParser(
+ description="Generate llms.txt file for Apache Spark documentation"
+ )
+ parser.add_argument(
+ "--docs-path", type=str, default="docs", help="Path to the docs
directory (default: docs)"
+ )
+ parser.add_argument(
+ "--output", type=str, default="llms.txt", help="Output file path
(default: llms.txt)"
+ )
+ parser.add_argument(
+ "--version",
+ type=str,
+ default="latest",
+ help="Spark documentation version (default: latest)",
+ )
+
+ args = parser.parse_args()
+
+ # Convert to Path objects
+ script_dir = Path(__file__).parent
+ project_root = script_dir.parent.parent # Go up two levels from
dev/create-release/
+ docs_path = project_root / args.docs_path
+ output_path = project_root / args.output
+
+ # Check if docs directory exists
+ if not docs_path.exists():
+ print(f"Error: Documentation directory '{docs_path}' does not exist")
+ sys.exit(1)
+
+ # Generate the llms.txt file
+ generate_llms_txt(docs_path, output_path, args.version)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/dev/create-release/release-build.sh
b/dev/create-release/release-build.sh
index b984876f4164..73ea87cfe6ac 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -833,6 +833,14 @@ if [[ "$1" == "docs" ]]; then
fi
bundle install
PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" bundle exec jekyll build
+
+ # Generate llms.txt for LLM consumption
+ echo "Generating llms.txt..."
+ python "$SELF/generate-llms-txt.py" \
+ --docs-path . \
+ --output _site/llms.txt \
+ --version "$SPARK_VERSION"
+
cd ..
cd ..
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]