It’s broken again…
> On Apr 3, 2026, at 12:18 PM, Nicholas Chammas <[email protected]> > wrote: > > Thanks for fixing this. I can confirm it’s working from my side. > > Looks like we need some kind of alert on Algolia's crawl status > <https://www.algolia.com/doc/tools/crawler/troubleshooting/crawl-status>. If > there’s a way a non-committer can help with this, let me know. > > >> On Apr 3, 2026, at 1:39 AM, Gengliang Wang <[email protected]> wrote: >> >> Hi Nicholas, >> >> The crawler configuration was not updated after the Spark 4.1.1 release, as >> documented in the release process >> <https://spark.apache.org/release-process.html>. I've fixed it. >> >> A unit test isn't really feasible here since the doc search is powered by >> Algolia, but we could set up an Algolia monitoring alert to catch this >> proactively. I'll look into it when I have the bandwidth. >> >> Gengliang >> >> On Wed, Apr 1, 2026 at 3:09 PM Nicholas Chammas <[email protected] >> <mailto:[email protected]>> wrote: >>> It’s broken again. This is the third breakage I am reporting in the past >>> couple of years. >>> >>> Is there some sort of alert or CI test we could setup to catch or prevent >>> this going forward? >>> >>> >>>> On Dec 21, 2025, at 1:35 PM, Gengliang Wang <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hi all, >>>> >>>> >>>> The crawler issue has been identified and fixed. >>>> >>>> The root cause was that by the crawler fails when the latest result >>>> contains less than 90% of the previous result. Increasing the >>>> `maxLostRecordsPercentage` threshold resolves the issue. >>>> >>>> https://www.algolia.com/doc/tools/crawler/apis/configuration/safety-checks >>>> >>>> >>>> On Wed, Dec 17, 2025 at 10:03 PM Xiao Li <[email protected] >>>> <mailto:[email protected]>> wrote: >>>>> Thanks for reporting it! Will take a look >>>>> >>>>> Nicholas Chammas <[email protected] >>>>> <mailto:[email protected]>> 于2025年12月5日周五 04:19写道: >>>>>> Bueller? >>>>>> >>>>>> Is anyone on this list able to fix the crawler? >>>>>> >>>>>> >>>>>>> On Dec 1, 2025, at 12:19 PM, Nicholas Chammas >>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> This seems to be happening again. >>>>>>> >>>>>>> Perhaps we should add a new test (but where, I wonder?) to ensure that >>>>>>> Algolia search doesn’t break without us knowing. >>>>>>> >>>>>>> Nick >>>>>>> >>>>>>> >>>>>>>> On Dec 11, 2023, at 5:02 AM, Gengliang Wang <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> Hi Nick, >>>>>>>> >>>>>>>> Thank you for reporting the issue with our web crawler. >>>>>>>> >>>>>>>> I've found that the issue was due to a change(specifically, pull >>>>>>>> request #40269 <https://github.com/apache/spark/pull/40269>) in the >>>>>>>> website's HTML structure, where the JavaScript selector >>>>>>>> ".container-wrapper" is now ".container". I've updated the crawler >>>>>>>> accordingly, and it's working properly now. >>>>>>>> >>>>>>>> Gengliang >>>>>>>> >>>>>>>> On Sun, Dec 10, 2023 at 8:15 AM Nicholas Chammas >>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>>> Pinging Gengliang and Xiao about this, per these docs >>>>>>>>> <https://github.com/apache/spark-website/blob/0ceaaaf528ec1d0201e1eab1288f37cce607268b/release-process.md#update-the-configuration-of-algolia-crawler>. >>>>>>>>> >>>>>>>>> It looks like to fix this problem you need access to the Algolia >>>>>>>>> Crawler Admin Console. >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Dec 5, 2023, at 11:28 AM, Nicholas Chammas >>>>>>>>>> <[email protected] <mailto:[email protected]>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Should I report this instead on Jira? Apologies if the dev list is >>>>>>>>>> not the right place. >>>>>>>>>> >>>>>>>>>> Search on the website appears to be broken. For example, here is a >>>>>>>>>> search for “analyze”: >>>>>>>>>> >>>>>>>>>> <Image 12-5-23 at 11.26 AM.jpeg> >>>>>>>>>> >>>>>>>>>> And here is the same search using DDG >>>>>>>>>> <https://duckduckgo.com/?q=site:https://spark.apache.org/docs/latest/+analyze&t=osx&ia=web>. >>>>>>>>>> >>>>>>>>>> Nick >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>> >
