Thank you, Allison and Hyukjin. IIUC, this proposal is not about a single file. SPIP already exposes multiple files which may increase our documentation and website size twice (or more in the worst case) because it's simply a duplication of the content. If we start to use AI tools to generate these LLMS.txt files, it could be much bigger than the original.
*** From SPIP *** - [PySpark (Python)]( https://spark.apache.org/docs/latest/api/python/llms.txt) - [Scala](https://spark.apache.org/docs/latest/api/scala/llms.txt) - [4.0.0 docs hub](https://archive.apache.org/dist/spark/docs/4.0.0/llms.txt) *** Since the size of Apache Spark 4.1.0-preview1 documentation is 1.2GB, could you propose to limit the total size of newly added llms.txt files under 10MB always systematically, Allison? If we don't have full controllability, this duplication will break the ASF Spark website like last year. We already inevitably archived old Spark documents from the original website location to "https://archive.apache.org/dist/spark/" due to the CI outage. $ du -h 4.1.0-preview1 | tail -n1 1.2G 4.1.0-preview1 The bottom line is that we need to have a clear hard limit for this newly proposed duplication for machine-friendly metadata. If we have a systematic way to control the upper bound which is less than 10MB per Spark version in total (now and forever), it sounds like a good addition. Thanks, Dongjoon. On Tue, Sep 9, 2025 at 7:19 PM Allison Wang <allisonw...@apache.org> wrote: > Yes, that’s right. It’s essentially just one markdown file to start with, > and we can add more later for language or version specific files if needed. > > On Tue, Sep 9, 2025 at 4:32 PM Hyukjin Kwon <gurwls...@apache.org> wrote: > >> so it's basically adding one text file for llm, right? I think it's a >> good idea. >> >> On Tue, 9 Sept 2025 at 10:22, Allison Wang <allisonw...@apache.org> >> wrote: >> >>> Hi all, >>> >>> I’d like to propose adding llms.txt files to the Spark documentation. >>> >>> As more users rely on AI-assisted tools and LLMs to learn, write Spark >>> code, and troubleshoot issues, it’s increasingly important that these tools >>> point back to the up-to-date official documentation. This will help >>> improve code generation quality and make new Spark features easier to >>> discover. The emerging llms.txt convention <https://llmstxt.org/> >>> provides a lightweight way to curate LLM-friendly manifests of key >>> documentation links. >>> >>> Would love to hear your feedback! >>> SPIP: >>> https://docs.google.com/document/d/1tRYdNTrIs8-JTgDthQ-7kcxEG7S91mNUVmUOfevW-cE/edit?tab=t.0#heading=h.wq8o4rl94dvr >>> JIRA: https://issues.apache.org/jira/browse/SPARK-53528 >>> >>> Thanks, >>> Allison >>> >>