+1 - I think it would be super helpful to split this single long doc.

Couple of points that might be useful:
- If we could create an individual page per operator with some concrete
examples, that would be great
- As Jungtaek mentioned, separating the API reference from the conceptual
portions might also be helpful
- If we could keep allied streaming docs (state data source reader, kafka
connector etc) referenced easily and also following a similar pattern, that
would also be great

As Jungtaek mentioned, this doesn't have to be as part of a single change.
We can split these changes over time.

Thanks,
Anish




On Sun, Aug 25, 2024 at 6:17 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Sorry for the late reply. I'm strongly supportive about the initial
> movement, splitting a crazy long one page of guide doc into multiple pages.
>
> If anyone ever looks at the SS guide doc page, they would agree that
> anyone can't simply say "RTFM" to learn about SS or use the page as
> reference. The Chrome plugin "The Read Time" tells me that *the SS guide
> page for Spark 3.5.2 has 22655 words and takes 100 mins to read*. (They
> will still need to read the SS + Kafka guide page as well in many cases.)
>
> Given the characteristic of context (there is a learning curve for
> newcomers), it's probably 1.5x ~ 2x (it may not even be sufficient) for
> newcomers and some contents are conceptual vs some other contents are
> almost for reference, so it's definitely helpful to start splitting the
> page into multiple pages. It'd be ideal if we could classify the
> conceptual + quick start content vs reference context and place them
> properly so that users could have their own pace and needs, but it doesn't
> need to be done at once.
>
> On Fri, Aug 23, 2024 at 10:54 PM Neil Ramaswamy <n...@ramaswamy.org>
> wrote:
>
>> Since it's been over 72 hours with no objections, I'm going to make a PR
>> with this change. If you have any specific opinions, we can discuss them on
>> GitHub.
>>
>> Neil
>>
>> On Tue, Aug 20, 2024 at 12:11 AM Neil Ramaswamy <n...@ramaswamy.org>
>> wrote:
>>
>>> Hi all,
>>>
>>> A few months ago, I started a thread about migrating our programming
>>> guides to be versionless. I had a POC, and the mostly-positive reception on
>>> the thread encouraged me to implement it for real.
>>>
>>> I did that recently here
>>> <https://github.com/neilramaswamy/spark-website/pull/2>, but there were
>>> a few critical issues: some guides (like MLlib) reference code examples in
>>> the apache/spark repo itself, and the SQL reference directly references the
>>> generated API reference using a Jekyll Liquid tag called include_api_gen. I
>>> think these are non-starters unless there is significant community interest.
>>>
>>> One of the motivations for versionless guides was to be able to quickly
>>> iterate to avoid large, SEO-impacting changes. However, with the challenge
>>> that versionless poses, I think it's better to just break apart the large
>>> guides, like the Structured Streaming one, and just hope that they rank
>>> well in Spark 4.0.0+.
>>>
>>> To that end, I've broken apart the Structured Streaming Programming
>>> Guide—it now resembles the MLlib and SQL reference guides. Critically, I
>>> have not changed *any *content. This work should make it easier for us
>>> to better paginate and structure our Structured Streaming docs in the
>>> future, which will make it easier for our users to consume. This is
>>> especially important because similar tools like Flink do a much nicer job
>>> of organizing content.
>>>
>>> You can view the changes on my personal site here
>>> <https://nr-spark-site.vercel.app/streaming/index.html>, and you can
>>> see the code changes here
>>> <https://github.com/neilramaswamy/nr-spark/pull/6>. Please let me know
>>> what you think; if there's no major objection, I will create a ticket and
>>> submit the PR.
>>>
>>> Best,
>>> Neil
>>>
>>

Reply via email to