I've always said I wanted to focus on performance and scale for SolrCloud, but for a long time that really just involved focusing on stability.
Now things have started to get pretty stable. Some things that made me cringe about SolrCloud no longer do in 7.3/7.4. Weeks back I found myself yet again looking for spurious, ugly issues around fragile connections that cause recovery headaches and random request fails. Again I made a change that should bring big improvements. Like many times before. I've had just about enough of that. Just about enough of broken connection reuse. Just about enough of countless wasteful threads and connections lurking and creaking all over. Just about enough of poor single update performance and weaknesses in batch updates. Just about enough of the painful ConcurrentUpdateSolrClient. So much inefficiency hiding in plain sight. Stuff I always thought we would overcome, but always far enough in the distance to keep me from feeling bad that I didn't know quite how we would get there. Solr was a container agnostic web application before Solr 5 for god's sake. Even relatively simple changes like upgrading our http client from version 3 to 4 was a huge amount of work for very incremental improvements. If I'm going to be excited about this system after all these years all of that has to change. I started looking into using a HTTP/2 and a new HttpClient that can do non blocking IO async requests. I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and difficult. Going to a fully different client has made me reconsider that. I did a lot of the work, but a good amount remains (security, finish SSL, tuning ...). I wrote a new Http2SolrClient that can replace HttpSolrClient and plug into CloudSolrClient and LBHttpSolrClient. I added some early async APIs. Non blocking IO async is about as oversold as "schemaless", but it's a great tool to have available as well. I'm now working in a much more efficient world, aiming for 1 connection per CoreContainer per remote destination. Connections are no longer fragile. The transfer protocol is no longer text based. Yonik should be pleased with the state of reordered updates from leader to replica. I replaced our CUSC usage for distributing updates with Http2SolrClient and async calls. I played with optionally using the async calls in the HttpShardHandler as well. I replaced all HttpSolrClient usage with Http2SolrClient. I started to get control of threads. I had control of connections. I added early efficient external request throttling. I started tuning resource pools. I started removing sleep polling loops. They are horrible and slow tests especially, we already have a replacement we are hardly using. I did some other related stuff. I'm just fixing the main things I hate along these communication/resource-usage/scale/perf themes. I'm calling this whole effort Star Burst: https://github.com/markrmiller/starburst I've done a ton. Mostly very late at night, it's not all perfect yet, some of it may be exploratory. There is a lot to do to wrap it up with a bow. This touches a lot of spots, our surface area of features is just huge now. Basically I have a high performance Solr fork at the moment (only setup for tests, not actually running stand alone Solr). I don't know how or when (or to be completely honest, if) it comes home. I'm going to do what I can, but it's likely to require more than me to be successful in a reasonable time frame. I have a couple JIRA issues open for HTTP/2 and the new SolrClient. Mark -- - Mark about.me/markrmiller
