[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-08-04 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569247#comment-16569247
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

Uploaded a new patch. In the latest patch I'm calling {{blockAndDoRetries}} 
before distributing the DBQ. I Also reduced the number of retries on standard 
requests from 5 to 3 (I did some experimentation I saw that the majority of 
requests either succeed or fail after the first couple requests). I'll do 
another check at the ChaosMonkey tests and commit this if I see no errors.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-11 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472501#comment-16472501
 ] 

Mark Miller commented on SOLR-11881:


Yeah, it's versioned, that is being counted on in SOLR-12305 as well. My bigger 
concern with retires would be stuff we didn't think - I know it's much tricker 
than the other distributed commands in terms of what ramifications changes have.

 bq. and then call the {{cmdDistrib.blockAndDoRetries();}}

Why don't we call that first? Isn't this just the same case as a commit? Commit 
has to blockAndDoRetries first, to make sure it applies to all previous updates.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472495#comment-16472495
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

I was looking into DBQ a bit more. It is actually versioned (which I didn’t 
know) so in theory we could retry too.  But there is another issue with them. 
In the {{doDeleteByQuery}} method we have this comment:
{noformat}
// NONE: we are the first to receive this deleteByQuery
//   - it must be forwarded to the leader of every shard
// TO:   we are a leader receiving a forwarded deleteByQuery... we must:
//   - block all updates (use VersionInfo)
//   - flush *all* updates going to our replicas
//   - forward the DBQ to our replicas and wait for the response
//   - log + execute the local DBQ
// FROM: we are a replica receiving a DBQ from our leader
//   - log + execute the local DBQ
{noformat}
However, that’s not what the code is doing now, not in the leader at least. We 
block, run locally (like we do with other operations), unblock, then we send 
the DBQ to followers by calling {{cmdDistrib.distribDelete(cmd, replicas, 
params, *false*, rollupReplicationTracker, leaderReplicationTracker);}}, and 
then call the {{cmdDistrib.blockAndDoRetries();}}.  The problem with that is 
that inside the cmdDistrib things can be reordered (and even more now since we 
are adding retries to updates), the DBQ needs to be the last request to be 
executed otherwise it can miss docs.

I think that call to {{cmdDistrib.distribDelete}} needs to be 
{{synchronous=true}}, that way we’ll flush (and retry) all updates before 
sending the DBQ, then send the DBQ and flush, and then continue. I’ll try to 
work on a test for that, but some feedback would be great. [~ysee...@gmail.com]

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> 

[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466856#comment-16466856
 ] 

Mark Miller commented on SOLR-11881:


Patch looks nice [~tomasflobbe] - very clean.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466841#comment-16466841
 ] 

Mark Miller commented on SOLR-11881:


Yeah, I'd be hesitant to add retries to DBQ without input from 
[~ysee...@gmail.com].

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466833#comment-16466833
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

I updated the CR with a new patch. Added a test for minRf, but this is more 
deeply tested in ReplicationFactorTest (that test now takes longer because of 
the retries. I'm thinking in either making the wait time configurable or modify 
it for test purposes only). ReplicationFactorTest is marked as {{@BadApple}} 
pointing to SOLR-6944, this retry logic will probably fix that one. I haven't 
seen failures of that test so far.
There is one nocommit in the code, I'm wondering if we want to keep the retries 
for DBQs. I'm thinking in setting the retry count for DBQs to 0, since those 
are not versioned AFAIK.
Another thing I noticed is that we sleep after each error retried (so if we 
need to retry two requests to two hosts, we sleep before the first request, and 
sleep before the second one). This seems odd, I think we want to sleep before 
retrying a batch of errors. I won't be changing this here though, I'll create a 
new Jira for that.
I'll be running some tests with the current patch, feel free to review and let 
me know if you have any thoughts

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> 

[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464563#comment-16464563
 ] 

Mark Miller commented on SOLR-11881:


bq. if we don't set the timeout on the HttpPost request by setting a request 
config

Cool - the tricky part is, if only one of the properties of the two is 
overridden on the client itself on the fly, we still want to pick up the 
default for the other one.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-04 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464557#comment-16464557
 ] 

Varun Thacker commented on SOLR-11881:
--

{quote} I think if timeouts are null, we need to try and pull them from the 
httpclient?
{quote}
I followed the code and if we don't set the timeout on the HttpPost request by 
setting a request config , it will use the default request config. In our case 
we set the default request config while creating the httpclient in 
HttpClientUtil#setupBuilder so it will use the values defined in the solr.xml 
file . I'll file a separate Jira right now
{code:java}
RequestConfig requestConfig = requestConfigBuilder.build();

HttpClientBuilder retBuilder = 
builder.setDefaultRequestConfig(requestConfig);{code}

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464539#comment-16464539
 ] 

Mark Miller commented on SOLR-11881:


[~varunthacker], when we update the httpclient we could no longer set things we 
did after construction - but our client api let you change these settings on 
the fly (I see those methods are deprecated now - good). So this was to not 
break that. I think if timeouts are null, we need to try and pull them from the 
httpclient?

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464533#comment-16464533
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

I believe my current patch breaks the "minRf" behavior. I'll take a look and 
add a test for that

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-04 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464502#comment-16464502
 ] 

Varun Thacker commented on SOLR-11881:
--

So recently I've been seeing this problem in this form:

- The replica get's a ReadPendingException from Jetty 
{code:java}
date time WARN [qtp768306356-580185] ? (:) - 
java.nio.channels.ReadPendingException: null
at org.eclipse.jetty.io.FillInterest.register(FillInterest.java:58) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.io.AbstractEndPoint.fillInterested(AbstractEndPoint.java:353) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]{code}
 * The leader keeps waiting till socket timeout and then get's a socket timeout 
exception and put's the replica into recovery

So I took Tomás latest patch and added SocketTimeoutException to the 
{{isRetriableException}} check. 

Q: What all exceptions should we retry on? Currently in the patch we have 
SocketException / NoHttpResponseException

 

Once I added SocketTimeoutException as a retriable exception , I then set the 
socket timeout to 100ms and sent updates to manually test if Solr's retrying 
correctly . To my surprise I was never able to hit a socket timeout exception . 
After some debugging here's why

In ConcurrentUpdateSolrClient we do this
{code:java}
org.apache.http.client.config.RequestConfig.Builder requestConfigBuilder = 
HttpClientUtil.createDefaultRequestConfigBuilder();
if (soTimeout != null) {
  requestConfigBuilder.setSocketTimeout(soTimeout);
}
if (connectionTimeout != null) {
  requestConfigBuilder.setConnectTimeout(connectionTimeout);
}

method.setConfig(requestConfigBuilder.build());{code}
So createDefaultRequestConfigBuilder doesn't respect the timeout set in 
solr.xml and uses a default of 60 seconds.

I debugged the code and if we simply remove these lines then the http-client 
will use the default requestConfig which Solr creates with the settings 
specified from the solr.xml file. 

Mark : Do you remember the motivation for overriding the defaults from update 
shard handlers httpclient and explicitly specifying a RequestConfig in CUSC? 
Happy to track this in a separate Jira 

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> 

[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464325#comment-16464325
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

Uploaded a new patch to https://reviews.apache.org/r/66967/

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461686#comment-16461686
 ] 

Mark Miller commented on SOLR-11881:


Yeah, now I remember why forward has a retry and why it was so high - same 
issue, to survive chaos monkey tests, even when you run them longer and if you 
run them over and over. So at least for the forwarding, I wouldn't lower it 
much without good confidence with beasting chaos monkey tests running a good 
amount of time (default test run times are somewhat low).

Basically, update forwarding to the leader allows the cloud client to fall back 
to sending to non leaders and get held up rather than having those updates fail 
and forcing the user to resolve it. Perhaps the client should just block 
updates itself for a while waiting to see a leader - but then it has to have 
kind of special logic - right now even a php client could take advantage of 
this by just falling back to sending updates to non leaders while failover 
happens.

I have no problem with updates from leader to replica retrying less.



> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To 

[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461629#comment-16461629
 ] 

Mark Miller commented on SOLR-11881:


Yeah, good catch - we are not versioned yet on forward.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461499#comment-16461499
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

bq. Yes it was, because it can happen mid request 
Ah! Good point. So we probably still don't to retry on those for the forwards, 
but we are OK with retrying on the FROMLEADER requests...

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461487#comment-16461487
 ] 

Mark Miller commented on SOLR-11881:


bq. But then I was looking at the ChaosMonkey logs and the amount of success 
after retries increased a lot in retries 5 to 10.

Yeah, okay, it's probably waiting for failover. I guess that is fine. That is 
probably how it went so high to begin with - allowing the forward to leader 
requests to wait for a new leader.

bq. I'm not sure if this was done with SocketException intentionally

Yes it was, because it can happen mid request and we don't know if the request 
failed or succeeded. Given we are counting on versions for retry though, this 
actually shouldnt matter, so that should be fine.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461455#comment-16461455
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

bq. What's the logic for removing the retry on that?
Not removing, {{ConnectException}} is a {{SocketException}} so it should be 
retried. Things like "broken pipe" are SocketExceptions and I think it should 
be fine to retry too. One thing though, I noticed that in 
{{SolrCmdDistributorTest}} there is a test case to explicitly validate that we 
don't retry on {{SocketException}}. I'm not sure if this was done with 
SocketException intentionally (because there is something I'm missing about 
this error case) or if this is just an example of exception was was not retried 
on. 
bq. I think something like 3 is good
That was my original plan too. But then I was looking at the ChaosMonkey logs 
and the amount of success after retries increased a lot in retries 5 to 10. I 
know this is just a synthetic situation but it's the best I have now. I'm 
thinking also in terms of time spent in retries, we wait 500 ms between 
retries, and 2.5 secs doesn't sound too bad if the consequence is saving Solr 
from a recovery. The impact on the other hand is slower updates in cases of 
single replicas being slow/faulty. Maybe this should be made configurable?

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> 

[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461353#comment-16461353
 ] 

Mark Miller commented on SOLR-11881:


Cool, looks like the right approach.

bq. ConnectException

What's the logic for removing the retry on that?

bq. I plan to reduce the number of retries

I think something like 3 is good, I think that is what we use at the HttpClient 
level.


> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX 
> r:core_node56 collection_shardX_replicaY] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-05-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460289#comment-16460289
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

Attached a rough patch that handles the retry in SolrCmdDistributor:
* I only added retry to requests from leader to it's replicas.
* Didn't add any tests yet, I've been running the ChaosMonkey to see how the 
retries behave
* I change the retry exception from only {{ConnectException}} to 
{{SocketException}} or {{NoHttpResponseException}}
* I plan to reduce the number of retries for this case (25 sounds like a lot, I 
was thinking of 5 or 10 max, but I'm open to suggestions)

[~varunthacker], [~markrmil...@gmail.com] let me know what you think

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, 
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-30 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458341#comment-16458341
 ] 

Varun Thacker commented on SOLR-11881:
--

> For the second exception here is something similiar: 
>[https://github.com/eclipse/jetty.project/issues/1047]

Yeah that issue seems to have been fixed but here's the mailing list thread 
that the jetty folks pointed me to a bug with ssl/async : 
[https://dev.eclipse.org/mhonarc/lists/jetty-dev/msg03165.html]
 

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-30 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458338#comment-16458338
 ] 

Mark Miller commented on SOLR-11881:


Another possibility (beyond some different bug) for the second one is that it's 
reusing connections and there is a case we don't fully clear the connection 
streams. Normally, Jetty will just not reuse that connection if that happens, 
perhaps SSL with this new async stuff can hit this if that happens though. Just 
a guess and reminder that I should review our code that ensures everyone is 
fully reading streams.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-29 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458334#comment-16458334
 ] 

Mark Miller commented on SOLR-11881:


For the second exception here is something similiar: 
https://github.com/eclipse/jetty.project/issues/1047

Thats fixed in jetty-9.3.17.v20170317 it looks though.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-29 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458322#comment-16458322
 ] 

Mark Miller commented on SOLR-11881:


[~varunthacker], look at the exception in the summary, you can almost be sure 
this is SOLR-12290. It's trying to just start the connection and on the first 
read finds out it's closed. This is the normal signature for when a server 
connection gets improperly closed.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-29 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458319#comment-16458319
 ] 

Mark Miller commented on SOLR-11881:


bq. I’m not sure we can know that updates before the failure were consumed 
(even if we call flush for each one)

Yeah, so we are sending individual requests on a single connection - each 
request is good if you get a good response and requests are fully serial per 
connection. So the information is there, it's just how hard is it to use as we 
need given the current code.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-29 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458316#comment-16458316
 ] 

Mark Miller commented on SOLR-11881:


SOLR-12290 may help with the cause of the connection reset. Hard to say though, 
SSL has different connection semantics I believe and we have not looked into 
how fully reusable they are.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457345#comment-16457345
 ] 

Mark Miller commented on SOLR-11881:


I think we may be able to use the RetryNode stuff that update forwarding to the 
leader does or something similar. I’d have to refresh, but I believe if we send 
an update, even streaming, we get a response for that update. If it’s success 
we are good. I don’t think we want to retry with the ConcurrentUpdateClient, 
but instead like forwards do with SolrCmdDistributor.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457273#comment-16457273
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

If I’m reading the code correctly, this RetryHandler is called in two cases, 
after an exception trying to establish the connection, and after an exception 
executing the request. Retrying in the first case should be fine, in the 
second, not so easy.
The way we do streaming is by keeping the connection and having an 
{{EntityTemplate}} that reads updates from a queue and writes each one and 
flushes. If I understand correctly, if we want to retry the request instead of 
throwing an error we need to retry the update too, by putting it back in the 
queue. Even then, I’m not sure we can know that updates before the failure were 
consumed (even if we call flush for each one)

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457124#comment-16457124
 ] 

Mark Miller commented on SOLR-11881:


bq. ConcurrentUpdateSolrClient#sendUpdateStream

Sounds right.

It should be using JavaBin and we only stream. It's the only way to do 
efficient high volume indexing. The only case you can really get away with no 
doing it is if you know the request is single document per request. That's how 
things used to work (even if you batch or streamed to the leader, it was split 
up into document per request), but it only works well with low load.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457076#comment-16457076
 ] 

Varun Thacker commented on SOLR-11881:
--

Hi Mark,

ConcurrentUpdateSolrClient#sendUpdateStream is the relevant code sending the 
update from leader->replica right? 

I don't know this piece of code very closely but do we only stream for xml ?  

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456848#comment-16456848
 ] 

Mark Miller commented on SOLR-11881:


Maybe we need to implement our own retry in distrib update handler.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456842#comment-16456842
 ] 

Mark Miller commented on SOLR-11881:


It's been a while since I've thought about this, mind is begin to churn again.

Does this even help? We stream updates from leader to the client, and streaming 
cannot be retried, you'd have to buffer the stream or something. It gets a 
can't retry exception.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456743#comment-16456743
 ] 

Mark Miller commented on SOLR-11881:


bq. is there anything that uses the update client where a retry would be a 
problem?

Hmm, perhaps when a replica forwards an update to the leader.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456742#comment-16456742
 ] 

Mark Miller commented on SOLR-11881:


We should also consider turning retry on for the read side as well. It's only 
done on IOException and you might not have another replica to retry to.

For the updates side I'm wondering if we should not just turn it on in general. 
We explicitly disable admin request from retry, is there anything that uses the 
update client where a retry would be a problem?

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456702#comment-16456702
 ] 

Mark Miller commented on SOLR-11881:


bq. Update requests between the leader and replica should be retry-able since 
they have been versioned.

Yes, nice, this was a big miss. It's the user client that can't easily auto 
retry.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456691#comment-16456691
 ] 

Tomás Fernández Löbbe commented on SOLR-11881:
--

+1

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-04-25 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452907#comment-16452907
 ] 

Varun Thacker commented on SOLR-11881:
--

We ran into another such scenario where a SocketTimeoutException caused LIR.

The replica had this in it's logs
{code:java}
date time WARN [qtp768306356-580185] ? (:) - 
java.nio.channels.ReadPendingException: null
at org.eclipse.jetty.io.FillInterest.register(FillInterest.java:58) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.io.AbstractEndPoint.fillInterested(AbstractEndPoint.java:353) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.io.AbstractConnection.fillInterested(AbstractConnection.java:134)
 ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) 
~[jetty-server-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
 ~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:289) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.ssl.SslConnection$3.succeeded(SslConnection.java:149) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) 
~[jetty-io-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708)
 ~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) 
~[jetty-util-9.4.8.v20171121.jar:9.4.8.v20171121]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0-zing_17.11.0.0]

date time WARN [qtp768306356-580185] ? (:) - Read pending for 
org.eclipse.jetty.server.HttpConnection$BlockingReadCallback@2e98df28 prevented 
AC.ReadCB@424271f8{HttpConnection@424271f8[p=HttpParser{s=START,0 of 
-1},g=HttpGenerator@424273ae{s=START}]=>HttpChannelOverHttp@4242713d{r=141,c=false,a=IDLE,uri=null}<-DecryptedEndPoint@4242708d{/host:52824<->/host:port,OPEN,fill=FI,flush=-,to=1/86400}->HttpConnection@424271f8[p=HttpParser{s=START,0
 of -1},g=HttpGenerator@{code}
And the leader waited exactly the socket timeout period after this error and 
threw a socket-timeout-exception . At that point the leader put the replica 
into recovery 

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> 

[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-01-25 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340481#comment-16340481
 ] 

Varun Thacker commented on SOLR-11881:
--

Hi Ere,

 

No these are different issues. We should fix both but this one's separate.

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11881) Connection Reset Causing LIR

2018-01-22 Thread Ere Maijala (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334040#comment-16334040
 ] 

Ere Maijala commented on SOLR-11881:


Does this also fix SOLR-9826?

> Connection Reset Causing LIR
> 
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Priority: Major
> Attachments: SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will 
> initiate LIR
> {code}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:person s:shard7_1 
> r:core_node56 x:person_shard7_1_replica1] 
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on 
> replica https://host08.domain:8985/solr/person_shard7_1_replica2/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy 
> working SolrCloud cluster, even a rare response like this from a replica can 
> cause a recovery and heavy cluster disruption" . 
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET 
> requests. Updates are POST requests 
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since 
> they have been versioned. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org