[ 
https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Thacker updated SOLR-11590:
---------------------------------
    Fix Version/s: master (8.0)

> Synchronize ZK connect/disconnect handling
> ------------------------------------------
>
>                 Key: SOLR-11590
>                 URL: https://issues.apache.org/jira/browse/SOLR-11590
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>            Assignee: Varun Thacker
>             Fix For: 7.2, master (8.0)
>
>         Attachments: SOLR-11590.patch, SOLR-11590.patch
>
>
> Here is a sequence of 2 disconnects and re-connects
> {code}
> 1. 2017-10-31T08:34:23.106-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
> 3. 2017-10-31T08:34:23.107-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> {code}
> {code}
> 1. 2017-10-31T08:36:46.541-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.549-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
> {code}
> In the first disconnect the sequence is -  get disconnect watcher, execute 
> disconnect code, execute connect code
> In the second disconnect the sequence is - get disconnect watcher, execute 
> connect code, execute disconnect code
> In the second sequence of events, if the JVM has leader replicas then all 
> updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." 
> . This starts happening exactly after 27 seconds ( zk client timeout is 30s , 
> 90% of 30 = 27 - when the code thinks the session is likely expired). No 
> leadership changes since there was no session expiry. Unless you restart the 
> node all updates to the system continue to fail.
> These log lines correspond are from Solr 5.3 hence where the WatchedEvent was 
> still being logged as INFO
> We process the connect code and then process the disconnect code out of order 
> based on the log ordering. The connection is active but the flag is not set 
> and hence after 27 seconds {{zkCheck}} starts complaining that the connection 
> is likely expired
> A related Jira is SOLR-5721
> ZK gives us ordered watch events ( 
> https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees
>  ) but from what I understand Solr can still process them out of order. We 
> could take a lock and synchronize {{ConnectionManager#connected}} and 
> {{ConnectionManager#disconnected}} . 
> Would that be the right approach to take?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to