Ah; that makes total sense; thanks. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley
On Sun, Feb 21, 2021 at 12:06 PM Ilan Ginzburg <ilans...@gmail.com> wrote: > Searching in my jenkins folder for failures of this test (label:jenkins > "FAILED: org.apache.solr.cloud.OverseerStatusTest.test") 26 emails match. > Searching for all jenkins master builds emails since the first failure > email found above (2 days ago), I see 40 messages. > 26 over 40 is not far from the expected 50% failure rate. > I believe the ratio in the graph you sent David (currently at 5.7%) is > averaged over a week, and includes failures from all branches (did some > other stats on jenkins emails that tend to confirm this assumption). > > On Sun, Feb 21, 2021 at 10:53 AM Ilan Ginzburg <ilans...@gmail.com> wrote: > >> Yes Marcus this is the commit. >> >> David I would have expected 50% failures, as 50% of the runs use >> distributed updates. I’ll try to understand better as I fix the issue. >> >> Ilan >> >> On Sun 21 Feb 2021 at 06:17, David Smiley <dsmi...@apache.org> wrote: >> >>> Interesting. Do you have a guess as to why the failures there are ~5% >>> and not 100% reproducible? >>> >>> ~ David Smiley >>> Apache Lucene/Solr Search Developer >>> http://www.linkedin.com/in/davidwsmiley >>> >>> >>> On Sat, Feb 20, 2021 at 6:41 PM Ilan Ginzburg <ilans...@gmail.com> >>> wrote: >>> >>>> Indeed the issue is due to my changes. >>>> >>>> In OverseerStatusCmd I've skipped some stat collection when running in >>>> distributed cluster state updates mode because I thought these were only >>>> stats related to cluster state updates. >>>> Obviously that was too aggressive and some of the stats are related to >>>> the Collection API. >>>> >>>> I will make sure to skip returning only the stats that are related to >>>> cluster state updater and restore returning collection api stats (when >>>> running in distributed cluster updates mode, otherwise all stats are >>>> returned). >>>> >>>> Tomorrow... >>>> >>>> Ilan >>>> >>>> On Sun, Feb 21, 2021 at 12:22 AM Ilan Ginzburg <ilans...@gmail.com> >>>> wrote: >>>> >>>>> Thank you David for reporting this. >>>>> >>>>> Seems due to my recent changes. I reproduce the failure locally and >>>>> will look at this tomorrow. >>>>> >>>>> With the distributed cluster state updates i've introduced a >>>>> randomization for using either Overseer based cluster state updates or >>>>> distributed cluster state updates in tests. This failure seems to happen >>>>> in >>>>> the distributed state update case. I suspect it is due to Overseer >>>>> returning less stats than expected by the test (which is expected: >>>>> Overseer >>>>> cannot return stats about cluster state updates if it does not handle >>>>> cluster state updates). >>>>> >>>>> The following line in the logs tells that the run is using distributed >>>>> cluster state: >>>>> 972874 INFO (jetty-launcher-8973-thread-2) [ ] >>>>> o.a.s.c.DistributedClusterStateUpdater Creating >>>>> DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr >>>>> will be using distributed cluster state updates. >>>>> >>>>> Ilan >>>>> >>>>> >>>>> On Sat, Feb 20, 2021 at 3:00 PM David Smiley <dsmi...@apache.org> >>>>> wrote: >>>>> >>>>>> I encountered a failure from OverseerStatusTest locally. According >>>>>> to our test failure trends, this guy only just recently started failing >>>>>> ~4-5% of the time, but previously was fine. Only master branch. >>>>>> >>>>>> >>>>>> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test >>>>>> >>>>>> ~ David Smiley >>>>>> Apache Lucene/Solr Search Developer >>>>>> http://www.linkedin.com/in/davidwsmiley >>>>>> >>>>>