[jira] [Comment Edited] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.

Jessica Cheng Mallet (JIRA) Fri, 22 Aug 2014 12:02:30 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107289#comment-14107289
 ]


Jessica Cheng Mallet edited comment on SOLR-6405 at 8/22/14 7:01 PM:
---------------------------------------------------------------------

Right, most likely the first time it hits the ConnectionLoss it's not time=0 of 
the connection loss, so by loop i=4, it would've slept for 15s since the i=0 
and therefore hit a SessionExpired.

But then, thinking about it again, why be clever at all about the padding or 
back-off?

Not to propose that we change this now, but let's pretend we don't do back-off 
and just sleep 1s between each loop. If we were to get ConnectionLoss back in 
the next attempt, there's no harm to try at all because if we're disconnected, 
the attempt wouldn't be hitting zookeeper anyway. If we were to get 
SessionExpired back, great, we can break out now and throw the exception. If 
we've reconnected, then yay, we succeeded. Because with each call we're 
expecting to get either success, failure (SessionExpired), or "in progress" 
(ConnectionLoss), we can really just retry "forever" without limiting the loop 
count (unless we're worried that somehow we'll keep getting ConnectionLoss even 
though the session has expired, but that'd be a pretty serious zookeeper client 
bug. And if we're really worried about that, we can always say do 10 more loops 
after we have slept a total of timeout already). The advantage of this approach 
is to never sleep for too long before finding out the definitive answer of 
success or SessionExpired, while if the answer is ConnectionLoss, it's not 
really incurring any extra load on zookeeper anyway.

In the end, it's really weird that this method should ever semantically allow 
throwing a ConnectionLoss exception, if we got the math wrong, because the 
intent is to retry until we get a SessionExpired, isn't it?


was (Author: mewmewball):
Right, most likely the first time it hits the ConnectionLoss it's not time=0 of 
the connection loss, so by loop i=4, it would've slept for 15s since the i=0 
and therefore hit a SessionExpired.

But then, thinking about it again, why be clever at all about the padding or 
back-off?

Not to propose that we change this now, but let's pretend we don't do back-off 
and just sleep 1s between each loop. If we were to get ConnectionLoss back in 
the next attempt, there's no harm to try at all because if we're disconnected, 
the attempt wouldn't be hitting zookeeper anyway. If we were to get 
SessionExpired back, great, we can break out now and throw the exception. If 
we've reconnected, then yay, we succeeded. Because with each call we're 
expecting to get either success, failure (SessionExpired), or "in progress" 
(ConnectionLoss), we can really just retry "forever" without limiting the loop 
count (unless we're worried that somehow we'll keep getting ConnectionLoss even 
though the session has expired, but that'd be a pretty serious zookeeper client 
bug. And if we're really worried about that, we can always say do 10 more loops 
after we have slept a total of timeout already).

In the end, it's really weird that this method should ever semantically allow 
throwing a ConnectionLoss exception, if we got the math wrong, because the 
intent is to retry until we get a SessionExpired, isn't it?

> ZooKeeper calls can easily not be retried enough on ConnectionLoss.
> -------------------------------------------------------------------
>
>                 Key: SOLR-6405
>                 URL: https://issues.apache.org/jira/browse/SOLR-6405
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.10
>
>         Attachments: SOLR-6405.patch
>
>
> The current design requires that we are sure we retry on connection loss 
> until session expiration.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.

Reply via email to