[ 
https://issues.apache.org/jira/browse/SOLR-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001429#comment-15001429
 ] 

Yonik Seeley commented on SOLR-6406:
------------------------------------

I was analyzing another "shards-out-of-sync" failure on trunk.
It looks like that certain update are just not being forwarded from the leader 
to a certain replica.

Working theory: the max connections per host of the HttpClient is being hit, 
starving updates from certain update threads.
This could account for why shutdownNow on the update executor service is having 
such an impact.  In an orderly shutdown, all scheduled jobs will still be run 
(I think), which means that connections will be released, and the updates that 
were being starved will get to proceed.  But it's for exactly this reason that 
we should probably keep the shutdownNow... it mimics much better what will 
happen in real world situations when a node goes down.

>From this, it looks like max connections per host is 20:

{code}
13404 INFO  
(TEST-HdfsChaosMonkeyNothingIsSafeTest.test-seed#[A22375CC545D2B82]) [    ] 
o.a.s.h.c.HttpShardHandlerFactory created with socketTimeout : 90000,urlScheme 
: ,connTimeout : 15000,maxConnectionsPerHost : 20,maxConnections : 
10000,corePoolSize : 0,maximumPoolSize : 2147483647,maxThreadIdleTime : 
5,sizeOfQueue : -1,fairnessPolicy : false,useRetries : false,
{code}

The test used 12 nodes (and 2 shards)... increasing the chance of hitting the 
max connections (since all nodes run on the same host).


> ConcurrentUpdateSolrServer hang in blockUntilFinished.
> ------------------------------------------------------
>
>                 Key: SOLR-6406
>                 URL: https://issues.apache.org/jira/browse/SOLR-6406
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Miller
>            Assignee: Yonik Seeley
>             Fix For: 5.4, Trunk
>
>         Attachments: CPU Sampling.png, SOLR-6406.patch, SOLR-6406.patch, 
> SOLR-6406.patch
>
>
> Not sure what is causing this, but SOLR-6136 may have taken us a step back 
> here. I see this problem occasionally pop up in ChaosMonkeyNothingIsSafeTest 
> now - test fails because of a thread leak, thread leak is due to a 
> ConcurrentUpdateSolrServer hang in blockUntilFinished. Only started popping 
> up recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to