+1 - I think it would be super helpful to split this single long doc. Couple of points that might be useful: - If we could create an individual page per operator with some concrete examples, that would be great - As Jungtaek mentioned, separating the API reference from the conceptual portions might also be helpful - If we could keep allied streaming docs (state data source reader, kafka connector etc) referenced easily and also following a similar pattern, that would also be great
As Jungtaek mentioned, this doesn't have to be as part of a single change. We can split these changes over time. Thanks, Anish On Sun, Aug 25, 2024 at 6:17 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Sorry for the late reply. I'm strongly supportive about the initial > movement, splitting a crazy long one page of guide doc into multiple pages. > > If anyone ever looks at the SS guide doc page, they would agree that > anyone can't simply say "RTFM" to learn about SS or use the page as > reference. The Chrome plugin "The Read Time" tells me that *the SS guide > page for Spark 3.5.2 has 22655 words and takes 100 mins to read*. (They > will still need to read the SS + Kafka guide page as well in many cases.) > > Given the characteristic of context (there is a learning curve for > newcomers), it's probably 1.5x ~ 2x (it may not even be sufficient) for > newcomers and some contents are conceptual vs some other contents are > almost for reference, so it's definitely helpful to start splitting the > page into multiple pages. It'd be ideal if we could classify the > conceptual + quick start content vs reference context and place them > properly so that users could have their own pace and needs, but it doesn't > need to be done at once. > > On Fri, Aug 23, 2024 at 10:54 PM Neil Ramaswamy <n...@ramaswamy.org> > wrote: > >> Since it's been over 72 hours with no objections, I'm going to make a PR >> with this change. If you have any specific opinions, we can discuss them on >> GitHub. >> >> Neil >> >> On Tue, Aug 20, 2024 at 12:11 AM Neil Ramaswamy <n...@ramaswamy.org> >> wrote: >> >>> Hi all, >>> >>> A few months ago, I started a thread about migrating our programming >>> guides to be versionless. I had a POC, and the mostly-positive reception on >>> the thread encouraged me to implement it for real. >>> >>> I did that recently here >>> <https://github.com/neilramaswamy/spark-website/pull/2>, but there were >>> a few critical issues: some guides (like MLlib) reference code examples in >>> the apache/spark repo itself, and the SQL reference directly references the >>> generated API reference using a Jekyll Liquid tag called include_api_gen. I >>> think these are non-starters unless there is significant community interest. >>> >>> One of the motivations for versionless guides was to be able to quickly >>> iterate to avoid large, SEO-impacting changes. However, with the challenge >>> that versionless poses, I think it's better to just break apart the large >>> guides, like the Structured Streaming one, and just hope that they rank >>> well in Spark 4.0.0+. >>> >>> To that end, I've broken apart the Structured Streaming Programming >>> Guide—it now resembles the MLlib and SQL reference guides. Critically, I >>> have not changed *any *content. This work should make it easier for us >>> to better paginate and structure our Structured Streaming docs in the >>> future, which will make it easier for our users to consume. This is >>> especially important because similar tools like Flink do a much nicer job >>> of organizing content. >>> >>> You can view the changes on my personal site here >>> <https://nr-spark-site.vercel.app/streaming/index.html>, and you can >>> see the code changes here >>> <https://github.com/neilramaswamy/nr-spark/pull/6>. Please let me know >>> what you think; if there's no major objection, I will create a ticket and >>> submit the PR. >>> >>> Best, >>> Neil >>> >>