Awesome, I'm glad that others think this is a good idea. I'd love a review of the initial splitting apart PR <https://github.com/apache/spark/pull/47864> if you have a chance.
And I have plans to break up the guide and refine the content even further; my personal Structured Streaming docs site <https://structured-streaming.vercel.app/> is what I'd like our official docs to head towards. There are operator-specific pages, dedicated pages for configs like output mode/triggers, e2e examples, etc. Neil On Sun, Aug 25, 2024 at 7:27 PM Anish Shrigondekar < anish.shrigonde...@databricks.com> wrote: > +1 - I think it would be super helpful to split this single long doc. > > Couple of points that might be useful: > - If we could create an individual page per operator with some concrete > examples, that would be great > - As Jungtaek mentioned, separating the API reference from the conceptual > portions might also be helpful > - If we could keep allied streaming docs (state data source reader, kafka > connector etc) referenced easily and also following a similar pattern, that > would also be great > > As Jungtaek mentioned, this doesn't have to be as part of a single change. > We can split these changes over time. > > Thanks, > Anish > > > > > On Sun, Aug 25, 2024 at 6:17 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> > wrote: > >> Sorry for the late reply. I'm strongly supportive about the initial >> movement, splitting a crazy long one page of guide doc into multiple pages. >> >> If anyone ever looks at the SS guide doc page, they would agree that >> anyone can't simply say "RTFM" to learn about SS or use the page as >> reference. The Chrome plugin "The Read Time" tells me that *the SS guide >> page for Spark 3.5.2 has 22655 words and takes 100 mins to read*. (They >> will still need to read the SS + Kafka guide page as well in many cases.) >> >> Given the characteristic of context (there is a learning curve for >> newcomers), it's probably 1.5x ~ 2x (it may not even be sufficient) for >> newcomers and some contents are conceptual vs some other contents are >> almost for reference, so it's definitely helpful to start splitting the >> page into multiple pages. It'd be ideal if we could classify the >> conceptual + quick start content vs reference context and place them >> properly so that users could have their own pace and needs, but it doesn't >> need to be done at once. >> >> On Fri, Aug 23, 2024 at 10:54 PM Neil Ramaswamy <n...@ramaswamy.org> >> wrote: >> >>> Since it's been over 72 hours with no objections, I'm going to make a PR >>> with this change. If you have any specific opinions, we can discuss them on >>> GitHub. >>> >>> Neil >>> >>> On Tue, Aug 20, 2024 at 12:11 AM Neil Ramaswamy <n...@ramaswamy.org> >>> wrote: >>> >>>> Hi all, >>>> >>>> A few months ago, I started a thread about migrating our programming >>>> guides to be versionless. I had a POC, and the mostly-positive reception on >>>> the thread encouraged me to implement it for real. >>>> >>>> I did that recently here >>>> <https://github.com/neilramaswamy/spark-website/pull/2>, but there >>>> were a few critical issues: some guides (like MLlib) reference code >>>> examples in the apache/spark repo itself, and the SQL reference directly >>>> references the generated API reference using a Jekyll Liquid tag called >>>> include_api_gen. I think these are non-starters unless there is significant >>>> community interest. >>>> >>>> One of the motivations for versionless guides was to be able to quickly >>>> iterate to avoid large, SEO-impacting changes. However, with the challenge >>>> that versionless poses, I think it's better to just break apart the large >>>> guides, like the Structured Streaming one, and just hope that they rank >>>> well in Spark 4.0.0+. >>>> >>>> To that end, I've broken apart the Structured Streaming Programming >>>> Guide—it now resembles the MLlib and SQL reference guides. Critically, I >>>> have not changed *any *content. This work should make it easier for us >>>> to better paginate and structure our Structured Streaming docs in the >>>> future, which will make it easier for our users to consume. This is >>>> especially important because similar tools like Flink do a much nicer job >>>> of organizing content. >>>> >>>> You can view the changes on my personal site here >>>> <https://nr-spark-site.vercel.app/streaming/index.html>, and you can >>>> see the code changes here >>>> <https://github.com/neilramaswamy/nr-spark/pull/6>. Please let me know >>>> what you think; if there's no major objection, I will create a ticket and >>>> submit the PR. >>>> >>>> Best, >>>> Neil >>>> >>>