Some of the fallout of this should be huge improvements to our tests. Right now, some of them take so long because no one even notices when they have done things to make the situation even worse and it's hard to monitor resource usage as we develop with it already fairly unbounded.
On master right now, on a lucky run (no tlog replica type for sure), BasicDistributedZkTest takes my 6 core machine from 2012 takes 76 seconds. Depending on how hard test injection hits, I've seen a few minutes and anywhere in between. Setting the tlog replica issue aside (I've disabled it for the moment, but I have fixed that issue by changing out distrib commits work), on the starburst branch, resource usage with multiple parallel tests running is going to be much, much better. For single cloud tests, performance is mostly about removing naive polling and carefree resource usage. The branch has big improvements for single and parallel tests already. I don't know how much left there is to fix, but already, on starburst, BasicDistributedZkTest takes 45 seconds vs master's 76 best case. - Mark On Wed, May 30, 2018 at 1:52 PM Mark Miller <[email protected]> wrote: > I've always said I wanted to focus on performance and scale for SolrCloud, > but for a long time that really just involved focusing on stability. > > Now things have started to get pretty stable. Some things that made me > cringe about SolrCloud no longer do in 7.3/7.4. > > Weeks back I found myself yet again looking for spurious, ugly issues > around fragile connections that cause recovery headaches and random request > fails. Again I made a change that should bring big improvements. Like many > times before. > > I've had just about enough of that. Just about enough of broken connection > reuse. Just about enough of countless wasteful threads and connections > lurking and creaking all over. Just about enough of poor single update > performance and weaknesses in batch updates. Just about enough of the > painful ConcurrentUpdateSolrClient. > > So much inefficiency hiding in plain sight. Stuff I always thought we > would overcome, but always far enough in the distance to keep me from > feeling bad that I didn't know quite how we would get there. Solr was a > container agnostic web application before Solr 5 for god's sake. Even > relatively simple changes like upgrading our http client from version 3 to > 4 was a huge amount of work for very incremental improvements. > > If I'm going to be excited about this system after all these years all of > that has to change. > > I started looking into using a HTTP/2 and a new HttpClient that can do non > blocking IO async requests. > > I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and > difficult. Going to a fully different client has made me reconsider that. I > did a lot of the work, but a good amount remains (security, finish SSL, > tuning ...). > > I wrote a new Http2SolrClient that can replace HttpSolrClient and plug > into CloudSolrClient and LBHttpSolrClient. I added some early async APIs. > Non blocking IO async is about as oversold as "schemaless", but it's a > great tool to have available as well. > > I'm now working in a much more efficient world, aiming for 1 connection > per CoreContainer per remote destination. Connections are no longer > fragile. The transfer protocol is no longer text based. > > Yonik should be pleased with the state of reordered updates from leader to > replica. > > I replaced our CUSC usage for distributing updates with Http2SolrClient > and async calls. > > I played with optionally using the async calls in the HttpShardHandler as > well. > > I replaced all HttpSolrClient usage with Http2SolrClient. > > I started to get control of threads. I had control of connections. > > I added early efficient external request throttling. > > I started tuning resource pools. > > I started removing sleep polling loops. They are horrible and slow tests > especially, we already have a replacement we are hardly using. > > I did some other related stuff. I'm just fixing the main things I hate > along these communication/resource-usage/scale/perf themes. > > I'm calling this whole effort Star Burst: > https://github.com/markrmiller/starburst > > I've done a ton. Mostly very late at night, it's not all perfect yet, some > of it may be exploratory. There is a lot to do to wrap it up with a bow. > This touches a lot of spots, our surface area of features is just huge now. > > Basically I have a high performance Solr fork at the moment (only setup > for tests, not actually running stand alone Solr). I don't know how or when > (or to be completely honest, if) it comes home. I'm going to do what I can, > but it's likely to require more than me to be successful in a reasonable > time frame. > > I have a couple JIRA issues open for HTTP/2 and the new SolrClient. > > Mark > > > -- > - Mark > about.me/markrmiller > -- - Mark about.me/markrmiller
