[ 
https://issues.apache.org/jira/browse/SOLR-6402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106433#comment-14106433
 ] 

Jessica Cheng Mallet commented on SOLR-6402:
--------------------------------------------

{quote}
All ZK manipulation should be through SolrZkClient, which should use 
ZkCmdExecutor to retry on connection loss passed expiration unless explicitly 
asked not to.
{quote}
Ah, I missed that.

So I took a look at ZkCmdExecutor.retryOperation(), we have this effect (for 
the default of 15s timeout and therefore retryCount=5):
i     sleep
0    0s
1    1.5s
2    3s
3    4.5s
4    6s

which adds up to 15s, the timeout. However, what if on loop i=4, the operation 
threw connection loss again, but then since the sleep is at the end of the 
catch block, while it slept the last time for 6s, the client reconnected so the 
session didn't expire? Maybe the intended thing is to do retryDelay(i+1) so 
that it would've slept 1.5s when i=0,..., and 6s when i=3, but retry i=4 at the 
end of 15s?

Disclaimer that I actually don't know that what I think may have happened 
happened at all, since, like I said, I only have that one log message and the 
fact that while OverseerCollectionProcessor died, the ClusterStateUpdater 
didn't die.

> OverseerCollectionProcessor should not exit for ZK ConnectionLoss
> -----------------------------------------------------------------
>
>                 Key: SOLR-6402
>                 URL: https://issues.apache.org/jira/browse/SOLR-6402
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.8, 5.0
>            Reporter: Jessica Cheng Mallet
>            Assignee: Mark Miller
>             Fix For: 5.0, 4.10
>
>
> We saw an occurrence where we had some ZK connection blip and the 
> OverseerCollectionProcessor thread stopped but the ClusterStateUpdater output 
> some error but kept running, and the node didn't lose its leadership. this 
> caused our collection work queue to back up.
> Right now OverseerCollectionProcessor's run method has on trunk:
> {quote}
> 344           if (e.code() == KeeperException.Code.SESSIONEXPIRED
> 345                 || e.code() == KeeperException.Code.CONNECTIONLOSS) \{
> 346               log.warn("Overseer cannot talk to ZK");
> 347               return;
> 348             \}
> {quote}
> I think this if statement should only be for SESSIONEXPIRED. If it just 
> experiences a connection loss but then reconnect before the session expired, 
> it'll keep being the leader.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to