John Gallagher created SOLR-11932:
-------------------------------------

             Summary: ZkCmdExector: Retry ZkOperation on SessionExpired 
                 Key: SOLR-11932
                 URL: https://issues.apache.org/jira/browse/SOLR-11932
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
    Affects Versions: 7.2
            Reporter: John Gallagher
         Attachments: SessionExpiredLog.txt, zk_retry.patch

We are seeing situations where an operation, such as changing a replica's state 
to active after a recovery, fails because the zk session has expired.

However, these operations seem like they are retryable, because the 
ZookeeperConnect receives an event that the session expired and tries to 
reconnect.

That makes the SessionExpired handling scenario seem very similar to the 
ConnectionLoss handling scenario, so the ZkCmdExecutor seems like it could 
handle them in the same way.

 

Here's an example stack trace with some slight redactions: 
[^SessionExpiredLog.txt]  In this case, a zk operation (a read) failed with a 
SessionExpired event, which seems retriable.  The exception kicked off a 
reconnection, but seems like the subsequent operation, (publishing as active) 
failed (perhaps it was using a stale connection handle at that point?)

 

Regardless, the watch mechanism that reestablishes connection on SessionExpired 
seems sufficient to allow the ZkCmdExecutor to retry that operation at a later 
time and have hope of succeeding.

 

I have included a simple patch we are trying that catches both exceptions 
instead of just ConnectionLossException: [^zk_retry.patch]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to