The protocol for this llms.txt is not a standard yet.

"*To clarify, llms.txt is not meant to be a duplication of the full
documentation.*"
Some like the Model Context Protocol (MCP)
<https://modelcontextprotocol.io/tutorials/building-mcp-with-llms> site
have their full web page in the llms page.
https://modelcontextprotocol.io/llms-full.txt


https://modelcontextprotocol.io/tutorials/building-mcp-with-llms

ons. 10. sep. 2025 kl. 22:27 skrev Allison Wang
<allison.w...@databricks.com.invalid>:

> Thanks Dongjoon for raising these concerns. I agree with your point that
> it’s worth making the lightweight manifest scope explicit in the SPIP so we
> have a systematic guarantee it stays small (under 10MB).
>
> To clarify, llms.txt is not meant to be a duplication of the full
> documentation. Instead, it acts more like an index or table of contents
> page: a small, curated manifest that points to existing canonical docs.
> The intent is to help AI-assisted tools and LLMs discover the right entry
> points, not to repackage the entire documentation set.
>
> For example this DuckDB's llms.txt
> <https://duckdb.org/docs/stable/llms.txt> file is around 30KB in
> size. Spark’s manifests will likely be a bit larger given the broader scope
> of APIs and documentation, but they should still remain lightweight
> link-only markdown files and well under the 10MB limit, even across
> multiple versions and language scopes.
>
> On Wed, Sep 10, 2025 at 8:47 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> This should just be a llm-facing index page of Spark docs? Given the
>> amount of APIs Spark provides today, I think this index page should be
>> useful to humans as well.
>>
>> On Wed, Sep 10, 2025 at 10:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Thank you, Allison and Hyukjin.
>>>
>>> IIUC, this proposal is not about a single file. SPIP already exposes
>>> multiple files which may increase our documentation and website size twice
>>> (or more in the worst case) because it's simply a duplication of the
>>> content. If we start to use AI tools to generate these LLMS.txt files, it
>>> could be much bigger than the original.
>>>
>>> *** From SPIP ***
>>> - [PySpark (Python)](
>>> https://spark.apache.org/docs/latest/api/python/llms.txt)
>>> - [Scala](https://spark.apache.org/docs/latest/api/scala/llms.txt)
>>> - [4.0.0 docs hub](
>>> https://archive.apache.org/dist/spark/docs/4.0.0/llms.txt)
>>> ***
>>>
>>> Since the size of Apache Spark 4.1.0-preview1 documentation is 1.2GB,
>>> could you propose to limit the total size of newly added llms.txt files
>>> under 10MB always systematically, Allison? If we don't have full
>>> controllability, this duplication will break the ASF Spark website like
>>> last year. We already inevitably archived old Spark documents from the
>>> original website location to "https://archive.apache.org/dist/spark/";
>>> due to the CI outage.
>>>
>>> $ du -h 4.1.0-preview1 | tail -n1
>>> 1.2G 4.1.0-preview1
>>>
>>> The bottom line is that we need to have a clear hard limit for this
>>> newly proposed duplication for machine-friendly metadata. If we have a
>>> systematic way to control the upper bound which is less than 10MB per Spark
>>> version in total (now and forever), it sounds like a good addition.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Sep 9, 2025 at 7:19 PM Allison Wang <allisonw...@apache.org>
>>> wrote:
>>>
>>>> Yes, that’s right. It’s essentially just one markdown file to start
>>>> with, and we can add more later for language or version specific files if
>>>> needed.
>>>>
>>>> On Tue, Sep 9, 2025 at 4:32 PM Hyukjin Kwon <gurwls...@apache.org>
>>>> wrote:
>>>>
>>>>> so it's basically adding one text file for llm, right? I think it's a
>>>>> good idea.
>>>>>
>>>>> On Tue, 9 Sept 2025 at 10:22, Allison Wang <allisonw...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I’d like to propose adding llms.txt files to the Spark documentation.
>>>>>>
>>>>>> As more users rely on AI-assisted tools and LLMs to learn, write
>>>>>> Spark code, and troubleshoot issues, it’s increasingly important that 
>>>>>> these
>>>>>> tools point back to the up-to-date official documentation. This will
>>>>>> help improve code generation quality and make new Spark features easier 
>>>>>> to
>>>>>> discover. The emerging llms.txt convention <https://llmstxt.org/>
>>>>>> provides a lightweight way to curate LLM-friendly manifests of key
>>>>>> documentation links.
>>>>>>
>>>>>> Would love to hear your feedback!
>>>>>> SPIP:
>>>>>> https://docs.google.com/document/d/1tRYdNTrIs8-JTgDthQ-7kcxEG7S91mNUVmUOfevW-cE/edit?tab=t.0#heading=h.wq8o4rl94dvr
>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-53528
>>>>>>
>>>>>> Thanks,
>>>>>> Allison
>>>>>>
>>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Reply via email to