[
https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472495#comment-16472495
]
Tomás Fernández Löbbe commented on SOLR-11881:
----------------------------------------------
I was looking into DBQ a bit more. It is actually versioned (which I didn’t
know) so in theory we could retry too. But there is another issue with them.
In the {{doDeleteByQuery}} method we have this comment:
{noformat}
// NONE: we are the first to receive this deleteByQuery
// - it must be forwarded to the leader of every shard
// TO: we are a leader receiving a forwarded deleteByQuery... we must:
// - block all updates (use VersionInfo)
// - flush *all* updates going to our replicas
// - forward the DBQ to our replicas and wait for the response
// - log + execute the local DBQ
// FROM: we are a replica receiving a DBQ from our leader
// - log + execute the local DBQ
{noformat}
However, that’s not what the code is doing now, not in the leader at least. We
block, run locally (like we do with other operations), unblock, then we send
the DBQ to followers by calling {{cmdDistrib.distribDelete(cmd, replicas,
params, *false*, rollupReplicationTracker, leaderReplicationTracker);}}, and
then call the {{cmdDistrib.blockAndDoRetries();}}. The problem with that is
that inside the cmdDistrib things can be reordered (and even more now since we
are adding retries to updates), the DBQ needs to be the last request to be
executed otherwise it can miss docs.
I think that call to {{cmdDistrib.distribDelete}} needs to be
{{synchronous=true}}, that way we’ll flush (and retry) all updates before
sending the DBQ, then send the DBQ and flush, and then continue. I’ll try to
work on a test for that, but some feedback would be great. [[email protected]]
> Connection Reset Causing LIR
> ----------------------------
>
> Key: SOLR-11881
> URL: https://issues.apache.org/jira/browse/SOLR-11881
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Varun Thacker
> Assignee: Varun Thacker
> Priority: Major
> Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch,
> SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will
> initiate LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX
> r:core_node56 collection_shardX_replicaY]
> o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on
> replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
> at
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
> at
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
> at
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
> at
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
> at
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
> at
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
> at
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
> at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy
> working SolrCloud cluster, even a rare response like this from a replica can
> cause a recovery and heavy cluster disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET
> requests. Updates are POST requests
> {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since
> they have been versioned.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]