[
https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Varun Thacker reassigned SOLR-11590:
------------------------------------
Resolution: Fixed
Assignee: Varun Thacker (was: Noble Paul)
Fix Version/s: 7.2
Thanks Scott and Noble!
> Synchronize ZK connect/disconnect handling
> ------------------------------------------
>
> Key: SOLR-11590
> URL: https://issues.apache.org/jira/browse/SOLR-11590
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Varun Thacker
> Assignee: Varun Thacker
> Fix For: 7.2
>
> Attachments: SOLR-11590.patch, SOLR-11590.patch
>
>
> Here is a sequence of 2 disconnects and re-connects
> {code}
> 1. 2017-10-31T08:34:23.106-0700 Watcher
> org.apache.solr.common.cloud.ConnectionManager@1579ca20
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
> 3. 2017-10-31T08:34:23.107-0700 Watcher
> org.apache.solr.common.cloud.ConnectionManager@1579ca20
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent
> state:SyncConnected type:None path:null path:null type:None
> {code}
> {code}
> 1. 2017-10-31T08:36:46.541-0700 Watcher
> org.apache.solr.common.cloud.ConnectionManager@1579ca20
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.549-0700 Watcher
> org.apache.solr.common.cloud.ConnectionManager@1579ca20
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent
> state:SyncConnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
> {code}
> In the first disconnect the sequence is - get disconnect watcher, execute
> disconnect code, execute connect code
> In the second disconnect the sequence is - get disconnect watcher, execute
> connect code, execute disconnect code
> In the second sequence of events, if the JVM has leader replicas then all
> updates start failing with "Cannot talk to ZooKeeper - Updates are disabled."
> . This starts happening exactly after 27 seconds ( zk client timeout is 30s ,
> 90% of 30 = 27 - when the code thinks the session is likely expired). No
> leadership changes since there was no session expiry. Unless you restart the
> node all updates to the system continue to fail.
> These log lines correspond are from Solr 5.3 hence where the WatchedEvent was
> still being logged as INFO
> We process the connect code and then process the disconnect code out of order
> based on the log ordering. The connection is active but the flag is not set
> and hence after 27 seconds {{zkCheck}} starts complaining that the connection
> is likely expired
> A related Jira is SOLR-5721
> ZK gives us ordered watch events (
> https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees
> ) but from what I understand Solr can still process them out of order. We
> could take a lock and synchronize {{ConnectionManager#connected}} and
> {{ConnectionManager#disconnected}} .
> Would that be the right approach to take?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]