I agree, Michael. We should add more functional validation to the benchmarks now. It is learning after this episode.
On Tue, 17 Jan, 2023, 11:13 pm Michael Gibney (Jira), <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/SOLR-16622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677868#comment-17677868 > ] > > Michael Gibney commented on SOLR-16622: > --------------------------------------- > > Thanks for this extra context, it's really helpful. > > {quote}this just shows that our testing is inadequate at the moment{quote} > > That makes sense broadly, IMO with some caveats (below). To state the > obvious: these are basically integration tests, and by nature are going to > be difficult to reproduce reliably, no matter how we proceed. > > On the one hand I agree it is fair to characterize this particular case as > a functional regression -- on the other hand "our testing is inadequate" > could easily be read as suggesting that existing unit tests and bats > integration tests should do a better job of covering these types of issues, > which I think would be misleading given the inherent challenges involved > with regularly running integration tests. Really, the existing test suite > is simply not designed to catch these kinds of "integration test" issues, > and even "bats" integration tests would be difficult to adapt to serve the > purpose of catching issues that only crop up when running at scale. > > "Straw man" argument: we could just lean in to periodic benchmarks helping > to catch these types of issues. The overhead of running integration tests > at scale would be significant. Even if the original intention of periodic > benchmarks is to evaluate performance, it may be ok (not really a problem) > that we end up catching some "integration test"-style issues as a > consequence. (to be clear, I'm kinda just thinking out loud; neither > assuming you agree nor disagree, Ishan!). > > > Replicas don't come up active after node restart > > ------------------------------------------------ > > > > Key: SOLR-16622 > > URL: https://issues.apache.org/jira/browse/SOLR-16622 > > Project: Solr > > Issue Type: Bug > > Security Level: Public(Default Security Level. Issues are Public) > > Reporter: Ishan Chattopadhyaya > > Priority: Major > > Fix For: 9.1.1 > > > > Attachments: Screenshot from 2023-01-17 15-03-05.png > > > > > > While benchmarking for performance, we saw a sharp change in the graphs: > > > https://issues.apache.org/jira/browse/SOLR-16525?focusedCommentId=17676725&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17676725 > > Turns out there was a commit (SOLR-16414) that escaped all testing and > caused a regression where restarted nodes didn't have the replicas coming > up as active. > > This affects 9.1 release, so opening a new JIRA issue to track it. > > Here's how to reproduce it: > > {code} > > git clone https://github.com/fullstorydev/solr-bench > > cd solr-bench > > # prerequisites on ubuntu: > > sudo apt install openjdk-11-jdk > > sudo apt install wget unzip zip ant ivy lsof git netcat make maven jq > > # this is a patch to comment out the cleanup/final shutdown > > wget https://termbin.com/yuu95 > > git apply yuu95 > > mvn clean compile assembly:single > > ./cleanup.sh && ./stress.sh -c aa4f3d98ab19c201e7f3c74cd14c99174148616d > suites/stress-facets-local.json > > {code} > > If the 95th percentile is <10 or so, we have a problem. It should be > >300 or so. Since, we disabled cleanup, we can hit > http://localhost:50000/solr/ to open Solr UI. In this case, I see that > querying to the ecommerce-events collection shows shard2 is down. > > > > -- > This message was sent by Atlassian Jira > (v8.20.10#820010) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org > For additional commands, e-mail: issues-h...@solr.apache.org > >