[ https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Varun Thacker updated SOLR-11590: --------------------------------- Fix Version/s: master (8.0) > Synchronize ZK connect/disconnect handling > ------------------------------------------ > > Key: SOLR-11590 > URL: https://issues.apache.org/jira/browse/SOLR-11590 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Varun Thacker > Assignee: Varun Thacker > Fix For: 7.2, master (8.0) > > Attachments: SOLR-11590.patch, SOLR-11590.patch > > > Here is a sequence of 2 disconnects and re-connects > {code} > 1. 2017-10-31T08:34:23.106-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected > 3. 2017-10-31T08:34:23.107-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > {code} > {code} > 1. 2017-10-31T08:36:46.541-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.549-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected > {code} > In the first disconnect the sequence is - get disconnect watcher, execute > disconnect code, execute connect code > In the second disconnect the sequence is - get disconnect watcher, execute > connect code, execute disconnect code > In the second sequence of events, if the JVM has leader replicas then all > updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." > . This starts happening exactly after 27 seconds ( zk client timeout is 30s , > 90% of 30 = 27 - when the code thinks the session is likely expired). No > leadership changes since there was no session expiry. Unless you restart the > node all updates to the system continue to fail. > These log lines correspond are from Solr 5.3 hence where the WatchedEvent was > still being logged as INFO > We process the connect code and then process the disconnect code out of order > based on the log ordering. The connection is active but the flag is not set > and hence after 27 seconds {{zkCheck}} starts complaining that the connection > is likely expired > A related Jira is SOLR-5721 > ZK gives us ordered watch events ( > https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees > ) but from what I understand Solr can still process them out of order. We > could take a lock and synchronize {{ConnectionManager#connected}} and > {{ConnectionManager#disconnected}} . > Would that be the right approach to take? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org