John Gallagher created SOLR-11932:
-------------------------------------
Summary: ZkCmdExector: Retry ZkOperation on SessionExpired
Key: SOLR-11932
URL: https://issues.apache.org/jira/browse/SOLR-11932
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: 7.2
Reporter: John Gallagher
Attachments: SessionExpiredLog.txt, zk_retry.patch
We are seeing situations where an operation, such as changing a replica's state
to active after a recovery, fails because the zk session has expired.
However, these operations seem like they are retryable, because the
ZookeeperConnect receives an event that the session expired and tries to
reconnect.
That makes the SessionExpired handling scenario seem very similar to the
ConnectionLoss handling scenario, so the ZkCmdExecutor seems like it could
handle them in the same way.
Here's an example stack trace with some slight redactions:
[^SessionExpiredLog.txt] In this case, a zk operation (a read) failed with a
SessionExpired event, which seems retriable. The exception kicked off a
reconnection, but seems like the subsequent operation, (publishing as active)
failed (perhaps it was using a stale connection handle at that point?)
Regardless, the watch mechanism that reestablishes connection on SessionExpired
seems sufficient to allow the ZkCmdExecutor to retry that operation at a later
time and have hope of succeeding.
I have included a simple patch we are trying that catches both exceptions
instead of just ConnectionLossException: [^zk_retry.patch]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]