Thanks Dongjoon for raising these concerns. I agree with your point that
it’s worth making the lightweight manifest scope explicit in the SPIP so we
have a systematic guarantee it stays small (under 10MB).

To clarify, llms.txt is not meant to be a duplication of the full
documentation. Instead, it acts more like an index or table of contents page:
a small, curated manifest that points to existing canonical docs. The
intent is to help AI-assisted tools and LLMs discover the right entry
points, not to repackage the entire documentation set.

For example this DuckDB's llms.txt <https://duckdb.org/docs/stable/llms.txt>
file is around 30KB in size. Spark’s manifests will likely be a bit larger
given the broader scope of APIs and documentation, but they should still
remain lightweight link-only markdown files and well under the 10MB limit,
even across multiple versions and language scopes.

On Wed, Sep 10, 2025 at 8:47 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> This should just be a llm-facing index page of Spark docs? Given the
> amount of APIs Spark provides today, I think this index page should be
> useful to humans as well.
>
> On Wed, Sep 10, 2025 at 10:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Thank you, Allison and Hyukjin.
>>
>> IIUC, this proposal is not about a single file. SPIP already exposes
>> multiple files which may increase our documentation and website size twice
>> (or more in the worst case) because it's simply a duplication of the
>> content. If we start to use AI tools to generate these LLMS.txt files, it
>> could be much bigger than the original.
>>
>> *** From SPIP ***
>> - [PySpark (Python)](
>> https://spark.apache.org/docs/latest/api/python/llms.txt)
>> - [Scala](https://spark.apache.org/docs/latest/api/scala/llms.txt)
>> - [4.0.0 docs hub](
>> https://archive.apache.org/dist/spark/docs/4.0.0/llms.txt)
>> ***
>>
>> Since the size of Apache Spark 4.1.0-preview1 documentation is 1.2GB,
>> could you propose to limit the total size of newly added llms.txt files
>> under 10MB always systematically, Allison? If we don't have full
>> controllability, this duplication will break the ASF Spark website like
>> last year. We already inevitably archived old Spark documents from the
>> original website location to "https://archive.apache.org/dist/spark/";
>> due to the CI outage.
>>
>> $ du -h 4.1.0-preview1 | tail -n1
>> 1.2G 4.1.0-preview1
>>
>> The bottom line is that we need to have a clear hard limit for this newly
>> proposed duplication for machine-friendly metadata. If we have a systematic
>> way to control the upper bound which is less than 10MB per Spark version in
>> total (now and forever), it sounds like a good addition.
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Tue, Sep 9, 2025 at 7:19 PM Allison Wang <allisonw...@apache.org>
>> wrote:
>>
>>> Yes, that’s right. It’s essentially just one markdown file to start
>>> with, and we can add more later for language or version specific files if
>>> needed.
>>>
>>> On Tue, Sep 9, 2025 at 4:32 PM Hyukjin Kwon <gurwls...@apache.org>
>>> wrote:
>>>
>>>> so it's basically adding one text file for llm, right? I think it's a
>>>> good idea.
>>>>
>>>> On Tue, 9 Sept 2025 at 10:22, Allison Wang <allisonw...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I’d like to propose adding llms.txt files to the Spark documentation.
>>>>>
>>>>> As more users rely on AI-assisted tools and LLMs to learn, write Spark
>>>>> code, and troubleshoot issues, it’s increasingly important that these 
>>>>> tools
>>>>> point back to the up-to-date official documentation. This will help
>>>>> improve code generation quality and make new Spark features easier to
>>>>> discover. The emerging llms.txt convention <https://llmstxt.org/>
>>>>> provides a lightweight way to curate LLM-friendly manifests of key
>>>>> documentation links.
>>>>>
>>>>> Would love to hear your feedback!
>>>>> SPIP:
>>>>> https://docs.google.com/document/d/1tRYdNTrIs8-JTgDthQ-7kcxEG7S91mNUVmUOfevW-cE/edit?tab=t.0#heading=h.wq8o4rl94dvr
>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-53528
>>>>>
>>>>> Thanks,
>>>>> Allison
>>>>>
>>>>

Reply via email to