[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464502#comment-16464502 ]
Varun Thacker commented on SOLR-11881: -------------------------------------- So recently I've been seeing this problem in this form: - The replica get's a ReadPendingException from Jetty {code:java} date time WARN [qtp768306356-580185] ? (:) - java.nio.channels.ReadPendingException: null at org.eclipse.jetty.io.FillInterest.register(FillInterest.java:58) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.AbstractEndPoint.fillInterested(AbstractEndPoint.java:353) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]{code} * The leader keeps waiting till socket timeout and then get's a socket timeout exception and put's the replica into recovery So I took Tomás latest patch and added SocketTimeoutException to the {{isRetriableException}} check. Q: What all exceptions should we retry on? Currently in the patch we have SocketException / NoHttpResponseException Once I added SocketTimeoutException as a retriable exception , I then set the socket timeout to 100ms and sent updates to manually test if Solr's retrying correctly . To my surprise I was never able to hit a socket timeout exception . After some debugging here's why In ConcurrentUpdateSolrClient we do this {code:java} org.apache.http.client.config.RequestConfig.Builder requestConfigBuilder = HttpClientUtil.createDefaultRequestConfigBuilder(); if (soTimeout != null) { requestConfigBuilder.setSocketTimeout(soTimeout); } if (connectionTimeout != null) { requestConfigBuilder.setConnectTimeout(connectionTimeout); } method.setConfig(requestConfigBuilder.build());{code} So createDefaultRequestConfigBuilder doesn't respect the timeout set in solr.xml and uses a default of 60 seconds. I debugged the code and if we simply remove these lines then the http-client will use the default requestConfig which Solr creates with the settings specified from the solr.xml file. Mark : Do you remember the motivation for overriding the defaults from update shard handlers httpclient and explicitly specifying a RequestConfig in CUSC? Happy to track this in a separate Jira > Connection Reset Causing LIR > ---------------------------- > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Varun Thacker > Assignee: Varun Thacker > Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org