Solr Star Burst - SolrCloud Performance / Scale

Mark Miller Wed, 30 May 2018 11:53:05 -0700

I've always said I wanted to focus on performance and scale for SolrCloud,
but for a long time that really just involved focusing on stability.


Now things have started to get pretty stable. Some things that made me
cringe about SolrCloud no longer do in 7.3/7.4.

Weeks back I found myself yet again looking for spurious, ugly issues
around fragile connections that cause recovery headaches and random request
fails. Again I made a change that should bring big improvements. Like many
times before.

I've had just about enough of that. Just about enough of broken connection
reuse. Just about enough of countless wasteful threads and connections
lurking and creaking all over. Just about enough of poor single update
performance and weaknesses in batch updates. Just about enough of the
painful ConcurrentUpdateSolrClient.

So much inefficiency hiding in plain sight. Stuff I always thought we would
overcome, but always far enough in the distance to keep me from feeling bad
that I didn't know quite how we would get there. Solr was a container
agnostic web application before Solr 5 for god's sake. Even relatively
simple changes like upgrading our http client from version 3 to 4 was a
huge amount of work for very incremental improvements.

If I'm going to be excited about this system after all these years all of
that has to change.

I started looking into using a HTTP/2 and a new HttpClient that can do non
blocking IO async requests.

I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and
difficult. Going to a fully different client has made me reconsider that. I
did a lot of the work, but a good amount remains (security, finish SSL,
tuning ...).

I wrote a new Http2SolrClient that can replace HttpSolrClient and plug into
CloudSolrClient and LBHttpSolrClient. I added some early async APIs. Non
blocking IO async is about as oversold as "schemaless", but it's a great
tool to have available as well.

I'm now working in a much more efficient world, aiming for 1 connection per
CoreContainer per remote destination. Connections are no longer fragile.
The transfer protocol is no longer text based.

Yonik should be pleased with the state of reordered updates from leader to
replica.

I replaced our CUSC usage for distributing updates with Http2SolrClient and
async calls.

I played with optionally using the async calls in the HttpShardHandler as
well.

I replaced all HttpSolrClient usage with Http2SolrClient.

I started to get control of threads. I had control of connections.

I added early efficient external request throttling.

I started tuning resource pools.

I started removing sleep polling loops. They are horrible and slow tests
especially, we already have a replacement we are hardly using.

I did some other related stuff. I'm just fixing the main things I hate
along these communication/resource-usage/scale/perf themes.

I'm calling this whole effort Star Burst:
https://github.com/markrmiller/starburst

I've done a ton. Mostly very late at night, it's not all perfect yet, some
of it may be exploratory. There is a lot to do to wrap it up with a bow.
This touches a lot of spots, our surface area of features is just huge now.

Basically I have a high performance Solr fork at the moment (only setup for
tests, not actually running stand alone Solr). I don't know how or when (or
to be completely honest, if) it comes home. I'm going to do what I can, but
it's likely to require more than me to be successful in a reasonable time
frame.

I have a couple JIRA issues open for HTTP/2 and the new SolrClient.

Mark


-- 
- Mark
about.me/markrmiller

Solr Star Burst - SolrCloud Performance / Scale

Reply via email to