John Gallagher created SOLR-11932: ------------------------------------- Summary: ZkCmdExector: Retry ZkOperation on SessionExpired Key: SOLR-11932 URL: https://issues.apache.org/jira/browse/SOLR-11932 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Affects Versions: 7.2 Reporter: John Gallagher Attachments: SessionExpiredLog.txt, zk_retry.patch
We are seeing situations where an operation, such as changing a replica's state to active after a recovery, fails because the zk session has expired. However, these operations seem like they are retryable, because the ZookeeperConnect receives an event that the session expired and tries to reconnect. That makes the SessionExpired handling scenario seem very similar to the ConnectionLoss handling scenario, so the ZkCmdExecutor seems like it could handle them in the same way. Here's an example stack trace with some slight redactions: [^SessionExpiredLog.txt] In this case, a zk operation (a read) failed with a SessionExpired event, which seems retriable. The exception kicked off a reconnection, but seems like the subsequent operation, (publishing as active) failed (perhaps it was using a stale connection handle at that point?) Regardless, the watch mechanism that reestablishes connection on SessionExpired seems sufficient to allow the ZkCmdExecutor to retry that operation at a later time and have hope of succeeding. I have included a simple patch we are trying that catches both exceptions instead of just ConnectionLossException: [^zk_retry.patch] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org