Varun Thacker created SOLR-11590:
------------------------------------
Summary: Synchronize ZK connect/disconnect handling
Key: SOLR-11590
URL: https://issues.apache.org/jira/browse/SOLR-11590
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Reporter: Varun Thacker
Priority: Major
Here is a sequence of 2 disconnects and re-connects
{code}
1. 2017-10-31T08:34:23.106-0700 Watcher
org.apache.solr.common.cloud.ConnectionManager@1579ca20
name:ZooKeeperConnection Watcher:host:port got event WatchedEvent
state:Disconnected type:None path:null path:null type:None
2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
3. 2017-10-31T08:34:23.107-0700 Watcher
org.apache.solr.common.cloud.ConnectionManager@1579ca20
name:ZooKeeperConnection Watcher:host:port got event WatchedEvent
state:SyncConnected type:None path:null path:null type:None
{code}
{code}
1. 2017-10-31T08:36:46.541-0700 Watcher
org.apache.solr.common.cloud.ConnectionManager@1579ca20
name:ZooKeeperConnection Watcher:host:port got event WatchedEvent
state:Disconnected type:None path:null path:null type:None
2. 2017-10-31T08:36:46.549-0700 Watcher
org.apache.solr.common.cloud.ConnectionManager@1579ca20
name:ZooKeeperConnection Watcher:host:port got event WatchedEvent
state:SyncConnected type:None path:null path:null type:None
2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
{code}
In the first disconnect the sequence is - get disconnect watcher, execute
disconnect code, execute connect code
In the second disconnect the sequence is - get disconnect watcher, execute
connect code, execute disconnect code
In the second sequence of events, if the JVM has leader replicas then all
updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." .
This starts happening exactly after 27 seconds ( zk client timeout is 30s , 90%
of 30 = 27 - when the code thinks the session is likely expired). No leadership
changes since there was no session expiry. Unless you restart the node all
updates to the system continue to fail.
These log lines correspond are from Solr 5.3 hence where the WatchedEvent was
still being logged as INFO
We process the connect code and then process the disconnect code out of order
based on the log ordering. The connection is active but the flag is not set and
hence after 27 seconds {{zkCheck}} starts complaining that the connection is
likely expired
A related Jira is SOLR-5721
ZK gives us ordered watch events (
https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees
) but from what I understand Solr can still process them out of order. We
could take a lock and synchronize {{ConnectionManager#connected}} and
{{ConnectionManager#disconnected}} .
Would that be the right approach to take?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]