Hi Mark, I've started glancing at the the repo and some of the issues you are addressing here will make things a lot more stable under high loads. I'll look at it in a little more detail in the coming days.
The key would be how to isolate the work in desecrate chunks to then go and make Jiras for. SOLR-12405 is the first thing that caught my eye that's an isolated jira and can be tackled without the http2 client etc On Wed, May 30, 2018 at 4:13 PM, Mark Miller <[email protected]> wrote: > Some of the fallout of this should be huge improvements to our tests. > Right now, some of them take so long because no one even notices when they > have done things to make the situation even worse and it's hard to monitor > resource usage as we develop with it already fairly unbounded. > > On master right now, on a lucky run (no tlog replica type for sure), > BasicDistributedZkTest takes my 6 core machine from 2012 takes 76 seconds. > Depending on how hard test injection hits, I've seen a few minutes and > anywhere in between. > > Setting the tlog replica issue aside (I've disabled it for the moment, but > I have fixed that issue by changing out distrib commits work), on the > starburst branch, resource usage with multiple parallel tests running is > going to be much, much better. For single cloud tests, performance is > mostly about removing naive polling and carefree resource usage. The branch > has big improvements for single and parallel tests already. > > I don't know how much left there is to fix, but already, on starburst, > BasicDistributedZkTest takes 45 seconds vs master's 76 best case. > > - Mark > > On Wed, May 30, 2018 at 1:52 PM Mark Miller <[email protected]> wrote: > >> I've always said I wanted to focus on performance and scale for >> SolrCloud, but for a long time that really just involved focusing on >> stability. >> >> Now things have started to get pretty stable. Some things that made me >> cringe about SolrCloud no longer do in 7.3/7.4. >> >> Weeks back I found myself yet again looking for spurious, ugly issues >> around fragile connections that cause recovery headaches and random request >> fails. Again I made a change that should bring big improvements. Like many >> times before. >> >> I've had just about enough of that. Just about enough of broken >> connection reuse. Just about enough of countless wasteful threads and >> connections lurking and creaking all over. Just about enough of poor single >> update performance and weaknesses in batch updates. Just about enough of >> the painful ConcurrentUpdateSolrClient. >> >> So much inefficiency hiding in plain sight. Stuff I always thought we >> would overcome, but always far enough in the distance to keep me from >> feeling bad that I didn't know quite how we would get there. Solr was a >> container agnostic web application before Solr 5 for god's sake. Even >> relatively simple changes like upgrading our http client from version 3 to >> 4 was a huge amount of work for very incremental improvements. >> >> If I'm going to be excited about this system after all these years all of >> that has to change. >> >> I started looking into using a HTTP/2 and a new HttpClient that can do >> non blocking IO async requests. >> >> I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and >> difficult. Going to a fully different client has made me reconsider that. I >> did a lot of the work, but a good amount remains (security, finish SSL, >> tuning ...). >> >> I wrote a new Http2SolrClient that can replace HttpSolrClient and plug >> into CloudSolrClient and LBHttpSolrClient. I added some early async APIs. >> Non blocking IO async is about as oversold as "schemaless", but it's a >> great tool to have available as well. >> >> I'm now working in a much more efficient world, aiming for 1 connection >> per CoreContainer per remote destination. Connections are no longer >> fragile. The transfer protocol is no longer text based. >> >> Yonik should be pleased with the state of reordered updates from leader >> to replica. >> >> I replaced our CUSC usage for distributing updates with Http2SolrClient >> and async calls. >> >> I played with optionally using the async calls in the HttpShardHandler as >> well. >> >> I replaced all HttpSolrClient usage with Http2SolrClient. >> >> I started to get control of threads. I had control of connections. >> >> I added early efficient external request throttling. >> >> I started tuning resource pools. >> >> I started removing sleep polling loops. They are horrible and slow tests >> especially, we already have a replacement we are hardly using. >> >> I did some other related stuff. I'm just fixing the main things I hate >> along these communication/resource-usage/scale/perf themes. >> >> I'm calling this whole effort Star Burst: https://github.com/ >> markrmiller/starburst >> >> I've done a ton. Mostly very late at night, it's not all perfect yet, >> some of it may be exploratory. There is a lot to do to wrap it up with a >> bow. This touches a lot of spots, our surface area of features is just huge >> now. >> >> Basically I have a high performance Solr fork at the moment (only setup >> for tests, not actually running stand alone Solr). I don't know how or when >> (or to be completely honest, if) it comes home. I'm going to do what I can, >> but it's likely to require more than me to be successful in a reasonable >> time frame. >> >> I have a couple JIRA issues open for HTTP/2 and the new SolrClient. >> >> Mark >> >> >> -- >> - Mark >> about.me/markrmiller >> > -- > - Mark > about.me/markrmiller >
