Searching in my jenkins folder for failures of this test (label:jenkins "FAILED: org.apache.solr.cloud.OverseerStatusTest.test") 26 emails match. Searching for all jenkins master builds emails since the first failure email found above (2 days ago), I see 40 messages. 26 over 40 is not far from the expected 50% failure rate. I believe the ratio in the graph you sent David (currently at 5.7%) is averaged over a week, and includes failures from all branches (did some other stats on jenkins emails that tend to confirm this assumption).
On Sun, Feb 21, 2021 at 10:53 AM Ilan Ginzburg <[email protected]> wrote: > Yes Marcus this is the commit. > > David I would have expected 50% failures, as 50% of the runs use > distributed updates. I’ll try to understand better as I fix the issue. > > Ilan > > On Sun 21 Feb 2021 at 06:17, David Smiley <[email protected]> wrote: > >> Interesting. Do you have a guess as to why the failures there are ~5% >> and not 100% reproducible? >> >> ~ David Smiley >> Apache Lucene/Solr Search Developer >> http://www.linkedin.com/in/davidwsmiley >> >> >> On Sat, Feb 20, 2021 at 6:41 PM Ilan Ginzburg <[email protected]> wrote: >> >>> Indeed the issue is due to my changes. >>> >>> In OverseerStatusCmd I've skipped some stat collection when running in >>> distributed cluster state updates mode because I thought these were only >>> stats related to cluster state updates. >>> Obviously that was too aggressive and some of the stats are related to >>> the Collection API. >>> >>> I will make sure to skip returning only the stats that are related to >>> cluster state updater and restore returning collection api stats (when >>> running in distributed cluster updates mode, otherwise all stats are >>> returned). >>> >>> Tomorrow... >>> >>> Ilan >>> >>> On Sun, Feb 21, 2021 at 12:22 AM Ilan Ginzburg <[email protected]> >>> wrote: >>> >>>> Thank you David for reporting this. >>>> >>>> Seems due to my recent changes. I reproduce the failure locally and >>>> will look at this tomorrow. >>>> >>>> With the distributed cluster state updates i've introduced a >>>> randomization for using either Overseer based cluster state updates or >>>> distributed cluster state updates in tests. This failure seems to happen in >>>> the distributed state update case. I suspect it is due to Overseer >>>> returning less stats than expected by the test (which is expected: Overseer >>>> cannot return stats about cluster state updates if it does not handle >>>> cluster state updates). >>>> >>>> The following line in the logs tells that the run is using distributed >>>> cluster state: >>>> 972874 INFO (jetty-launcher-8973-thread-2) [ ] >>>> o.a.s.c.DistributedClusterStateUpdater Creating >>>> DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr >>>> will be using distributed cluster state updates. >>>> >>>> Ilan >>>> >>>> >>>> On Sat, Feb 20, 2021 at 3:00 PM David Smiley <[email protected]> >>>> wrote: >>>> >>>>> I encountered a failure from OverseerStatusTest locally. According to >>>>> our test failure trends, this guy only just recently started failing ~4-5% >>>>> of the time, but previously was fine. Only master branch. >>>>> >>>>> >>>>> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test >>>>> >>>>> ~ David Smiley >>>>> Apache Lucene/Solr Search Developer >>>>> http://www.linkedin.com/in/davidwsmiley >>>>> >>>>
