The protocol for this llms.txt is not a standard yet. "*To clarify, llms.txt is not meant to be a duplication of the full documentation.*" Some like the Model Context Protocol (MCP) <https://modelcontextprotocol.io/tutorials/building-mcp-with-llms> site have their full web page in the llms page. https://modelcontextprotocol.io/llms-full.txt
https://modelcontextprotocol.io/tutorials/building-mcp-with-llms ons. 10. sep. 2025 kl. 22:27 skrev Allison Wang <allison.w...@databricks.com.invalid>: > Thanks Dongjoon for raising these concerns. I agree with your point that > it’s worth making the lightweight manifest scope explicit in the SPIP so we > have a systematic guarantee it stays small (under 10MB). > > To clarify, llms.txt is not meant to be a duplication of the full > documentation. Instead, it acts more like an index or table of contents > page: a small, curated manifest that points to existing canonical docs. > The intent is to help AI-assisted tools and LLMs discover the right entry > points, not to repackage the entire documentation set. > > For example this DuckDB's llms.txt > <https://duckdb.org/docs/stable/llms.txt> file is around 30KB in > size. Spark’s manifests will likely be a bit larger given the broader scope > of APIs and documentation, but they should still remain lightweight > link-only markdown files and well under the 10MB limit, even across > multiple versions and language scopes. > > On Wed, Sep 10, 2025 at 8:47 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> This should just be a llm-facing index page of Spark docs? Given the >> amount of APIs Spark provides today, I think this index page should be >> useful to humans as well. >> >> On Wed, Sep 10, 2025 at 10:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Thank you, Allison and Hyukjin. >>> >>> IIUC, this proposal is not about a single file. SPIP already exposes >>> multiple files which may increase our documentation and website size twice >>> (or more in the worst case) because it's simply a duplication of the >>> content. If we start to use AI tools to generate these LLMS.txt files, it >>> could be much bigger than the original. >>> >>> *** From SPIP *** >>> - [PySpark (Python)]( >>> https://spark.apache.org/docs/latest/api/python/llms.txt) >>> - [Scala](https://spark.apache.org/docs/latest/api/scala/llms.txt) >>> - [4.0.0 docs hub]( >>> https://archive.apache.org/dist/spark/docs/4.0.0/llms.txt) >>> *** >>> >>> Since the size of Apache Spark 4.1.0-preview1 documentation is 1.2GB, >>> could you propose to limit the total size of newly added llms.txt files >>> under 10MB always systematically, Allison? If we don't have full >>> controllability, this duplication will break the ASF Spark website like >>> last year. We already inevitably archived old Spark documents from the >>> original website location to "https://archive.apache.org/dist/spark/" >>> due to the CI outage. >>> >>> $ du -h 4.1.0-preview1 | tail -n1 >>> 1.2G 4.1.0-preview1 >>> >>> The bottom line is that we need to have a clear hard limit for this >>> newly proposed duplication for machine-friendly metadata. If we have a >>> systematic way to control the upper bound which is less than 10MB per Spark >>> version in total (now and forever), it sounds like a good addition. >>> >>> Thanks, >>> Dongjoon. >>> >>> >>> On Tue, Sep 9, 2025 at 7:19 PM Allison Wang <allisonw...@apache.org> >>> wrote: >>> >>>> Yes, that’s right. It’s essentially just one markdown file to start >>>> with, and we can add more later for language or version specific files if >>>> needed. >>>> >>>> On Tue, Sep 9, 2025 at 4:32 PM Hyukjin Kwon <gurwls...@apache.org> >>>> wrote: >>>> >>>>> so it's basically adding one text file for llm, right? I think it's a >>>>> good idea. >>>>> >>>>> On Tue, 9 Sept 2025 at 10:22, Allison Wang <allisonw...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I’d like to propose adding llms.txt files to the Spark documentation. >>>>>> >>>>>> As more users rely on AI-assisted tools and LLMs to learn, write >>>>>> Spark code, and troubleshoot issues, it’s increasingly important that >>>>>> these >>>>>> tools point back to the up-to-date official documentation. This will >>>>>> help improve code generation quality and make new Spark features easier >>>>>> to >>>>>> discover. The emerging llms.txt convention <https://llmstxt.org/> >>>>>> provides a lightweight way to curate LLM-friendly manifests of key >>>>>> documentation links. >>>>>> >>>>>> Would love to hear your feedback! >>>>>> SPIP: >>>>>> https://docs.google.com/document/d/1tRYdNTrIs8-JTgDthQ-7kcxEG7S91mNUVmUOfevW-cE/edit?tab=t.0#heading=h.wq8o4rl94dvr >>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-53528 >>>>>> >>>>>> Thanks, >>>>>> Allison >>>>>> >>>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297