[
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461686#comment-16461686
]
Mark Miller commented on SOLR-11881:
------------------------------------
Yeah, now I remember why forward has a retry and why it was so high - same
issue, to survive chaos monkey tests, even when you run them longer and if you
run them over and over. So at least for the forwarding, I wouldn't lower it
much without good confidence with beasting chaos monkey tests running a good
amount of time (default test run times are somewhat low).
Basically, update forwarding to the leader allows the cloud client to fall back
to sending to non leaders and get held up rather than having those updates fail
and forcing the user to resolve it. Perhaps the client should just block
updates itself for a while waiting to see a leader - but then it has to have
kind of special logic - right now even a php client could take advantage of
this by just falling back to sending updates to non leaders while failover
happens.
I have no problem with updates from leader to replica retrying less.
> Connection Reset Causing LIR
> ----------------------------
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Varun Thacker
> Assignee: Varun Thacker
> Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch,
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX
> r:core_node56 collection_shardX_replicaY]
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy
> working SolrCloud cluster, even a rare response like this from a replica can
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET
> requests. Updates are POST requests
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since
> they have been versioned.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]