[
https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107289#comment-14107289
]
Jessica Cheng Mallet edited comment on SOLR-6405 at 8/22/14 7:01 PM:
---------------------------------------------------------------------
Right, most likely the first time it hits the ConnectionLoss it's not time=0 of
the connection loss, so by loop i=4, it would've slept for 15s since the i=0
and therefore hit a SessionExpired.
But then, thinking about it again, why be clever at all about the padding or
back-off?
Not to propose that we change this now, but let's pretend we don't do back-off
and just sleep 1s between each loop. If we were to get ConnectionLoss back in
the next attempt, there's no harm to try at all because if we're disconnected,
the attempt wouldn't be hitting zookeeper anyway. If we were to get
SessionExpired back, great, we can break out now and throw the exception. If
we've reconnected, then yay, we succeeded. Because with each call we're
expecting to get either success, failure (SessionExpired), or "in progress"
(ConnectionLoss), we can really just retry "forever" without limiting the loop
count (unless we're worried that somehow we'll keep getting ConnectionLoss even
though the session has expired, but that'd be a pretty serious zookeeper client
bug. And if we're really worried about that, we can always say do 10 more loops
after we have slept a total of timeout already). The advantage of this approach
is to never sleep for too long before finding out the definitive answer of
success or SessionExpired, while if the answer is ConnectionLoss, it's not
really incurring any extra load on zookeeper anyway.
In the end, it's really weird that this method should ever semantically allow
throwing a ConnectionLoss exception, if we got the math wrong, because the
intent is to retry until we get a SessionExpired, isn't it?
was (Author: mewmewball):
Right, most likely the first time it hits the ConnectionLoss it's not time=0 of
the connection loss, so by loop i=4, it would've slept for 15s since the i=0
and therefore hit a SessionExpired.
But then, thinking about it again, why be clever at all about the padding or
back-off?
Not to propose that we change this now, but let's pretend we don't do back-off
and just sleep 1s between each loop. If we were to get ConnectionLoss back in
the next attempt, there's no harm to try at all because if we're disconnected,
the attempt wouldn't be hitting zookeeper anyway. If we were to get
SessionExpired back, great, we can break out now and throw the exception. If
we've reconnected, then yay, we succeeded. Because with each call we're
expecting to get either success, failure (SessionExpired), or "in progress"
(ConnectionLoss), we can really just retry "forever" without limiting the loop
count (unless we're worried that somehow we'll keep getting ConnectionLoss even
though the session has expired, but that'd be a pretty serious zookeeper client
bug. And if we're really worried about that, we can always say do 10 more loops
after we have slept a total of timeout already).
In the end, it's really weird that this method should ever semantically allow
throwing a ConnectionLoss exception, if we got the math wrong, because the
intent is to retry until we get a SessionExpired, isn't it?
> ZooKeeper calls can easily not be retried enough on ConnectionLoss.
> -------------------------------------------------------------------
>
> Key: SOLR-6405
> URL: https://issues.apache.org/jira/browse/SOLR-6405
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Mark Miller
> Assignee: Mark Miller
> Priority: Critical
> Fix For: 5.0, 4.10
>
> Attachments: SOLR-6405.patch
>
>
> The current design requires that we are sure we retry on connection loss
> until session expiration.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]