[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569247#comment-16569247 ] Tomás Fernández Löbbe commented on SOLR-11881: -- Uploaded a new patch. In the latest patch I'm calling {{blockAndDoRetries}} before distributing the DBQ. I Also reduced the number of retries on standard requests from 5 to 3 (I did some experimentation I saw that the majority of requests either succeed or fail after the first couple requests). I'll do another check at the ChaosMonkey tests and commit this if I see no errors. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472501#comment-16472501 ] Mark Miller commented on SOLR-11881: Yeah, it's versioned, that is being counted on in SOLR-12305 as well. My bigger concern with retires would be stuff we didn't think - I know it's much tricker than the other distributed commands in terms of what ramifications changes have. bq. and then call the {{cmdDistrib.blockAndDoRetries();}} Why don't we call that first? Isn't this just the same case as a commit? Commit has to blockAndDoRetries first, to make sure it applies to all previous updates. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472495#comment-16472495 ] Tomás Fernández Löbbe commented on SOLR-11881: -- I was looking into DBQ a bit more. It is actually versioned (which I didn’t know) so in theory we could retry too. But there is another issue with them. In the {{doDeleteByQuery}} method we have this comment: {noformat} // NONE: we are the first to receive this deleteByQuery // - it must be forwarded to the leader of every shard // TO: we are a leader receiving a forwarded deleteByQuery... we must: // - block all updates (use VersionInfo) // - flush *all* updates going to our replicas // - forward the DBQ to our replicas and wait for the response // - log + execute the local DBQ // FROM: we are a replica receiving a DBQ from our leader // - log + execute the local DBQ {noformat} However, that’s not what the code is doing now, not in the leader at least. We block, run locally (like we do with other operations), unblock, then we send the DBQ to followers by calling {{cmdDistrib.distribDelete(cmd, replicas, params, *false*, rollupReplicationTracker, leaderReplicationTracker);}}, and then call the {{cmdDistrib.blockAndDoRetries();}}. The problem with that is that inside the cmdDistrib things can be reordered (and even more now since we are adding retries to updates), the DBQ needs to be the last request to be executed otherwise it can miss docs. I think that call to {{cmdDistrib.distribDelete}} needs to be {{synchronous=true}}, that way we’ll flush (and retry) all updates before sending the DBQ, then send the DBQ and flush, and then continue. I’ll try to work on a test for that, but some feedback would be great. [~ysee...@gmail.com] > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466856#comment-16466856 ] Mark Miller commented on SOLR-11881: Patch looks nice [~tomasflobbe] - very clean. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466841#comment-16466841 ] Mark Miller commented on SOLR-11881: Yeah, I'd be hesitant to add retries to DBQ without input from [~ysee...@gmail.com]. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466833#comment-16466833 ] Tomás Fernández Löbbe commented on SOLR-11881: -- I updated the CR with a new patch. Added a test for minRf, but this is more deeply tested in ReplicationFactorTest (that test now takes longer because of the retries. I'm thinking in either making the wait time configurable or modify it for test purposes only). ReplicationFactorTest is marked as {{@BadApple}} pointing to SOLR-6944, this retry logic will probably fix that one. I haven't seen failures of that test so far. There is one nocommit in the code, I'm wondering if we want to keep the retries for DBQs. I'm thinking in setting the retry count for DBQs to 0, since those are not versioned AFAIK. Another thing I noticed is that we sleep after each error retried (so if we need to retry two requests to two hosts, we sleep before the first request, and sleep before the second one). This seems odd, I think we want to sleep before retrying a batch of errors. I won't be changing this here though, I'll create a new Jira for that. I'll be running some tests with the current patch, feel free to review and let me know if you have any thoughts > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since >
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464563#comment-16464563 ] Mark Miller commented on SOLR-11881: bq. if we don't set the timeout on the HttpPost request by setting a request config Cool - the tricky part is, if only one of the properties of the two is overridden on the client itself on the fly, we still want to pick up the default for the other one. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464557#comment-16464557 ] Varun Thacker commented on SOLR-11881: -- {quote} I think if timeouts are null, we need to try and pull them from the httpclient? {quote} I followed the code and if we don't set the timeout on the HttpPost request by setting a request config , it will use the default request config. In our case we set the default request config while creating the httpclient in HttpClientUtil#setupBuilder so it will use the values defined in the solr.xml file . I'll file a separate Jira right now {code:java} RequestConfig requestConfig = requestConfigBuilder.build(); HttpClientBuilder retBuilder = builder.setDefaultRequestConfig(requestConfig);{code} > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464539#comment-16464539 ] Mark Miller commented on SOLR-11881: [~varunthacker], when we update the httpclient we could no longer set things we did after construction - but our client api let you change these settings on the fly (I see those methods are deprecated now - good). So this was to not break that. I think if timeouts are null, we need to try and pull them from the httpclient? > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464533#comment-16464533 ] Tomás Fernández Löbbe commented on SOLR-11881: -- I believe my current patch breaks the "minRf" behavior. I'll take a look and add a test for that > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464502#comment-16464502 ] Varun Thacker commented on SOLR-11881: -- So recently I've been seeing this problem in this form: - The replica get's a ReadPendingException from Jetty {code:java} date time WARN [qtp768306356-580185] ? (:) - java.nio.channels.ReadPendingException: null at org.eclipse.jetty.io.FillInterest.register(FillInterest.java:58) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.AbstractEndPoint.fillInterested(AbstractEndPoint.java:353) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]{code} * The leader keeps waiting till socket timeout and then get's a socket timeout exception and put's the replica into recovery So I took Tomás latest patch and added SocketTimeoutException to the {{isRetriableException}} check. Q: What all exceptions should we retry on? Currently in the patch we have SocketException / NoHttpResponseException Once I added SocketTimeoutException as a retriable exception , I then set the socket timeout to 100ms and sent updates to manually test if Solr's retrying correctly . To my surprise I was never able to hit a socket timeout exception . After some debugging here's why In ConcurrentUpdateSolrClient we do this {code:java} org.apache.http.client.config.RequestConfig.Builder requestConfigBuilder = HttpClientUtil.createDefaultRequestConfigBuilder(); if (soTimeout != null) { requestConfigBuilder.setSocketTimeout(soTimeout); } if (connectionTimeout != null) { requestConfigBuilder.setConnectTimeout(connectionTimeout); } method.setConfig(requestConfigBuilder.build());{code} So createDefaultRequestConfigBuilder doesn't respect the timeout set in solr.xml and uses a default of 60 seconds. I debugged the code and if we simply remove these lines then the http-client will use the default requestConfig which Solr creates with the settings specified from the solr.xml file. Mark : Do you remember the motivation for overriding the defaults from update shard handlers httpclient and explicitly specifying a RequestConfig in CUSC? Happy to track this in a separate Jira > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at >
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464325#comment-16464325 ] Tomás Fernández Löbbe commented on SOLR-11881: -- Uploaded a new patch to https://reviews.apache.org/r/66967/ > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461686#comment-16461686 ] Mark Miller commented on SOLR-11881: Yeah, now I remember why forward has a retry and why it was so high - same issue, to survive chaos monkey tests, even when you run them longer and if you run them over and over. So at least for the forwarding, I wouldn't lower it much without good confidence with beasting chaos monkey tests running a good amount of time (default test run times are somewhat low). Basically, update forwarding to the leader allows the cloud client to fall back to sending to non leaders and get held up rather than having those updates fail and forcing the user to resolve it. Perhaps the client should just block updates itself for a while waiting to see a leader - but then it has to have kind of special logic - right now even a php client could take advantage of this by just falling back to sending updates to non leaders while failover happens. I have no problem with updates from leader to replica retrying less. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461629#comment-16461629 ] Mark Miller commented on SOLR-11881: Yeah, good catch - we are not versioned yet on forward. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461499#comment-16461499 ] Tomás Fernández Löbbe commented on SOLR-11881: -- bq. Yes it was, because it can happen mid request Ah! Good point. So we probably still don't to retry on those for the forwards, but we are OK with retrying on the FROMLEADER requests... > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461487#comment-16461487 ] Mark Miller commented on SOLR-11881: bq. But then I was looking at the ChaosMonkey logs and the amount of success after retries increased a lot in retries 5 to 10. Yeah, okay, it's probably waiting for failover. I guess that is fine. That is probably how it went so high to begin with - allowing the forward to leader requests to wait for a new leader. bq. I'm not sure if this was done with SocketException intentionally Yes it was, because it can happen mid request and we don't know if the request failed or succeeded. Given we are counting on versions for retry though, this actually shouldnt matter, so that should be fine. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461455#comment-16461455 ] Tomás Fernández Löbbe commented on SOLR-11881: -- bq. What's the logic for removing the retry on that? Not removing, {{ConnectException}} is a {{SocketException}} so it should be retried. Things like "broken pipe" are SocketExceptions and I think it should be fine to retry too. One thing though, I noticed that in {{SolrCmdDistributorTest}} there is a test case to explicitly validate that we don't retry on {{SocketException}}. I'm not sure if this was done with SocketException intentionally (because there is something I'm missing about this error case) or if this is just an example of exception was was not retried on. bq. I think something like 3 is good That was my original plan too. But then I was looking at the ChaosMonkey logs and the amount of success after retries increased a lot in retries 5 to 10. I know this is just a synthetic situation but it's the best I have now. I'm thinking also in terms of time spent in retries, we wait 500 ms between retries, and 2.5 secs doesn't sound too bad if the consequence is saving Solr from a recovery. The impact on the other hand is slower updates in cases of single replicas being slow/faulty. Maybe this should be made configurable? > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests >
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461353#comment-16461353 ] Mark Miller commented on SOLR-11881: Cool, looks like the right approach. bq. ConnectException What's the logic for removing the retry on that? bq. I plan to reduce the number of retries I think something like 3 is good, I think that is what we use at the HttpClient level. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code:java} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX > r:core_node56 collection_shardX_replicaY] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/collection_shardX_replicaY/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460289#comment-16460289 ] Tomás Fernández Löbbe commented on SOLR-11881: -- Attached a rough patch that handles the retry in SolrCmdDistributor: * I only added retry to requests from leader to it's replicas. * Didn't add any tests yet, I've been running the ChaosMonkey to see how the retries behave * I change the retry exception from only {{ConnectException}} to {{SocketException}} or {{NoHttpResponseException}} * I plan to reduce the number of retries for this case (25 sounds like a lot, I was thinking of 5 or 10 max, but I'm open to suggestions) [~varunthacker], [~markrmil...@gmail.com] let me know what you think > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, > SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458341#comment-16458341 ] Varun Thacker commented on SOLR-11881: -- > For the second exception here is something similiar: >[https://github.com/eclipse/jetty.project/issues/1047] Yeah that issue seems to have been fixed but here's the mailing list thread that the jetty folks pointed me to a bug with ssl/async : [https://dev.eclipse.org/mhonarc/lists/jetty-dev/msg03165.html] > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458338#comment-16458338 ] Mark Miller commented on SOLR-11881: Another possibility (beyond some different bug) for the second one is that it's reusing connections and there is a case we don't fully clear the connection streams. Normally, Jetty will just not reuse that connection if that happens, perhaps SSL with this new async stuff can hit this if that happens though. Just a guess and reminder that I should review our code that ensures everyone is fully reading streams. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458334#comment-16458334 ] Mark Miller commented on SOLR-11881: For the second exception here is something similiar: https://github.com/eclipse/jetty.project/issues/1047 Thats fixed in jetty-9.3.17.v20170317 it looks though. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458322#comment-16458322 ] Mark Miller commented on SOLR-11881: [~varunthacker], look at the exception in the summary, you can almost be sure this is SOLR-12290. It's trying to just start the connection and on the first read finds out it's closed. This is the normal signature for when a server connection gets improperly closed. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458319#comment-16458319 ] Mark Miller commented on SOLR-11881: bq. I’m not sure we can know that updates before the failure were consumed (even if we call flush for each one) Yeah, so we are sending individual requests on a single connection - each request is good if you get a good response and requests are fully serial per connection. So the information is there, it's just how hard is it to use as we need given the current code. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458316#comment-16458316 ] Mark Miller commented on SOLR-11881: SOLR-12290 may help with the cause of the connection reset. Hard to say though, SSL has different connection semantics I believe and we have not looked into how fully reusable they are. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457345#comment-16457345 ] Mark Miller commented on SOLR-11881: I think we may be able to use the RetryNode stuff that update forwarding to the leader does or something similar. I’d have to refresh, but I believe if we send an update, even streaming, we get a response for that update. If it’s success we are good. I don’t think we want to retry with the ConcurrentUpdateClient, but instead like forwards do with SolrCmdDistributor. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457273#comment-16457273 ] Tomás Fernández Löbbe commented on SOLR-11881: -- If I’m reading the code correctly, this RetryHandler is called in two cases, after an exception trying to establish the connection, and after an exception executing the request. Retrying in the first case should be fine, in the second, not so easy. The way we do streaming is by keeping the connection and having an {{EntityTemplate}} that reads updates from a queue and writes each one and flushes. If I understand correctly, if we want to retry the request instead of throwing an error we need to retry the update too, by putting it back in the queue. Even then, I’m not sure we can know that updates before the failure were consumed (even if we call flush for each one) > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457124#comment-16457124 ] Mark Miller commented on SOLR-11881: bq. ConcurrentUpdateSolrClient#sendUpdateStream Sounds right. It should be using JavaBin and we only stream. It's the only way to do efficient high volume indexing. The only case you can really get away with no doing it is if you know the request is single document per request. That's how things used to work (even if you batch or streamed to the leader, it was split up into document per request), but it only works well with low load. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457076#comment-16457076 ] Varun Thacker commented on SOLR-11881: -- Hi Mark, ConcurrentUpdateSolrClient#sendUpdateStream is the relevant code sending the update from leader->replica right? I don't know this piece of code very closely but do we only stream for xml ? > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456848#comment-16456848 ] Mark Miller commented on SOLR-11881: Maybe we need to implement our own retry in distrib update handler. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456842#comment-16456842 ] Mark Miller commented on SOLR-11881: It's been a while since I've thought about this, mind is begin to churn again. Does this even help? We stream updates from leader to the client, and streaming cannot be retried, you'd have to buffer the stream or something. It gets a can't retry exception. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456743#comment-16456743 ] Mark Miller commented on SOLR-11881: bq. is there anything that uses the update client where a retry would be a problem? Hmm, perhaps when a replica forwards an update to the leader. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456742#comment-16456742 ] Mark Miller commented on SOLR-11881: We should also consider turning retry on for the read side as well. It's only done on IOException and you might not have another replica to retry to. For the updates side I'm wondering if we should not just turn it on in general. We explicitly disable admin request from retry, is there anything that uses the update client where a retry would be a problem? > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456702#comment-16456702 ] Mark Miller commented on SOLR-11881: bq. Update requests between the leader and replica should be retry-able since they have been versioned. Yes, nice, this was a big miss. It's the user client that can't easily auto retry. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456691#comment-16456691 ] Tomás Fernández Löbbe commented on SOLR-11881: -- +1 > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch, SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452907#comment-16452907 ] Varun Thacker commented on SOLR-11881: -- We ran into another such scenario where a SocketTimeoutException caused LIR. The replica had this in it's logs {code:java} date time WARN [qtp768306356-580185] ? (:) - java.nio.channels.ReadPendingException: null at org.eclipse.jetty.io.FillInterest.register(FillInterest.java:58) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.AbstractEndPoint.fillInterested(AbstractEndPoint.java:353) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.AbstractConnection.fillInterested(AbstractConnection.java:134) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) ~[jetty-server-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:289) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.ssl.SslConnection$3.succeeded(SslConnection.java:149) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247) ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140) ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382) ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708) ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121] at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121] at java.lang.Thread.run(Thread.java:748) [?:1.8.0-zing_17.11.0.0] date time WARN [qtp768306356-580185] ? (:) - Read pending for org.eclipse.jetty.server.HttpConnection$BlockingReadCallback@2e98df28 prevented AC.ReadCB@424271f8{HttpConnection@424271f8[p=HttpParser{s=START,0 of -1},g=HttpGenerator@424273ae{s=START}]=>HttpChannelOverHttp@4242713d{r=141,c=false,a=IDLE,uri=null}<-DecryptedEndPoint@4242708d{/host:52824<->/host:port,OPEN,fill=FI,flush=-,to=1/86400}->HttpConnection@424271f8[p=HttpParser{s=START,0 of -1},g=HttpGenerator@{code} And the leader waited exactly the socket timeout period after this error and threw a socket-timeout-exception . At that point the leader put the replica into recovery > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at >
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340481#comment-16340481 ] Varun Thacker commented on SOLR-11881: -- Hi Ere, No these are different issues. We should fix both but this one's separate. > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
[ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334040#comment-16334040 ] Ere Maijala commented on SOLR-11881: Does this also fix SOLR-9826? > Connection Reset Causing LIR > > > Key: SOLR-11881 > URL: https://issues.apache.org/jira/browse/SOLR-11881 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Priority: Major > Attachments: SOLR-11881.patch > > > We can see that a connection reset is causing LIR. > If a leader -> replica update get's a connection like this the leader will > initiate LIR > {code} > 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 > r:core_node56 x:person_shard7_1_replica1] > o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on > replica https://host08.domain:8985/solr/person_shard7_1_replica2/ > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) > at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) > at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543) > at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409) > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy > working SolrCloud cluster, even a rare response like this from a replica can > cause a recovery and heavy cluster disruption" . > Looking at SOLR-6931 we added a http retry handler but we only retry on GET > requests. Updates are POST requests > {{ConcurrentUpdateSolrClient#sendUpdateStream}} > Update requests between the leader and replica should be retry-able since > they have been versioned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org