Re: Solr Star Burst - SolrCloud Performance / Scale

Varun Thacker Wed, 30 May 2018 20:18:38 -0700

Hi Mark,

I've started glancing at the the repo and some of the issues you are
addressing here will make things a lot more stable under high loads. I'll
look at it in a little more detail in the coming days.


The key would be how to isolate the work in desecrate chunks to then go and
make Jiras for. SOLR-12405 is the first thing that caught my eye that's an
isolated jira and can be tackled without the http2 client etc

On Wed, May 30, 2018 at 4:13 PM, Mark Miller <[email protected]> wrote:

> Some of the fallout of this should be huge improvements to our tests.
> Right now, some of them take so long because no one even notices when they
> have done things to make the situation even worse and it's hard to monitor
> resource usage as we develop with it already fairly unbounded.
>
> On master right now, on a lucky run (no tlog replica type for sure),
> BasicDistributedZkTest takes my 6 core machine from 2012 takes 76 seconds.
> Depending on how hard test injection hits, I've seen a few minutes and
> anywhere in between.
>
> Setting the tlog replica issue aside (I've disabled it for the moment, but
> I have fixed that issue by changing out distrib commits work), on the
> starburst branch, resource usage with multiple parallel tests running is
> going to be much, much better. For single cloud tests, performance is
> mostly about removing naive polling and carefree resource usage. The branch
> has big improvements for single and parallel tests already.
>
> I don't know how much left there is to fix, but already, on starburst,
> BasicDistributedZkTest takes 45 seconds vs master's 76 best case.
>
> - Mark
>
> On Wed, May 30, 2018 at 1:52 PM Mark Miller <[email protected]> wrote:
>
>> I've always said I wanted to focus on performance and scale for
>> SolrCloud, but for a long time that really just involved focusing on
>> stability.
>>
>> Now things have started to get pretty stable. Some things that made me
>> cringe about SolrCloud no longer do in 7.3/7.4.
>>
>> Weeks back I found myself yet again looking for spurious, ugly issues
>> around fragile connections that cause recovery headaches and random request
>> fails. Again I made a change that should bring big improvements. Like many
>> times before.
>>
>> I've had just about enough of that. Just about enough of broken
>> connection reuse. Just about enough of countless wasteful threads and
>> connections lurking and creaking all over. Just about enough of poor single
>> update performance and weaknesses in batch updates. Just about enough of
>> the painful ConcurrentUpdateSolrClient.
>>
>> So much inefficiency hiding in plain sight. Stuff I always thought we
>> would overcome, but always far enough in the distance to keep me from
>> feeling bad that I didn't know quite how we would get there. Solr was a
>> container agnostic web application before Solr 5 for god's sake. Even
>> relatively simple changes like upgrading our http client from version 3 to
>> 4 was a huge amount of work for very incremental improvements.
>>
>> If I'm going to be excited about this system after all these years all of
>> that has to change.
>>
>> I started looking into using a HTTP/2 and a new HttpClient that can do
>> non blocking IO async requests.
>>
>> I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and
>> difficult. Going to a fully different client has made me reconsider that. I
>> did a lot of the work, but a good amount remains (security, finish SSL,
>> tuning ...).
>>
>> I wrote a new Http2SolrClient that can replace HttpSolrClient and plug
>> into CloudSolrClient and LBHttpSolrClient. I added some early async APIs.
>> Non blocking IO async is about as oversold as "schemaless", but it's a
>> great tool to have available as well.
>>
>> I'm now working in a much more efficient world, aiming for 1 connection
>> per CoreContainer per remote destination. Connections are no longer
>> fragile. The transfer protocol is no longer text based.
>>
>> Yonik should be pleased with the state of reordered updates from leader
>> to replica.
>>
>> I replaced our CUSC usage for distributing updates with Http2SolrClient
>> and async calls.
>>
>> I played with optionally using the async calls in the HttpShardHandler as
>> well.
>>
>> I replaced all HttpSolrClient usage with Http2SolrClient.
>>
>> I started to get control of threads. I had control of connections.
>>
>> I added early efficient external request throttling.
>>
>> I started tuning resource pools.
>>
>> I started removing sleep polling loops. They are horrible and slow tests
>> especially, we already have a replacement we are hardly using.
>>
>> I did some other related stuff. I'm just fixing the main things I hate
>> along these communication/resource-usage/scale/perf themes.
>>
>> I'm calling this whole effort Star Burst: https://github.com/
>> markrmiller/starburst
>>
>> I've done a ton. Mostly very late at night, it's not all perfect yet,
>> some of it may be exploratory. There is a lot to do to wrap it up with a
>> bow. This touches a lot of spots, our surface area of features is just huge
>> now.
>>
>> Basically I have a high performance Solr fork at the moment (only setup
>> for tests, not actually running stand alone Solr). I don't know how or when
>> (or to be completely honest, if) it comes home. I'm going to do what I can,
>> but it's likely to require more than me to be successful in a reasonable
>> time frame.
>>
>> I have a couple JIRA issues open for HTTP/2 and the new SolrClient.
>>
>> Mark
>>
>>
>> --
>> - Mark
>> about.me/markrmiller
>>
> --
> - Mark
> about.me/markrmiller
>

Re: Solr Star Burst - SolrCloud Performance / Scale

Reply via email to