[jira] [Commented] (SOLR-11590) Synchronize ZK connect/disconnect handling
[ https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279288#comment-16279288 ] ASF subversion and git services commented on SOLR-11590: Commit 5c10ec49af582d83422266b7357f0b50023b939b in lucene-solr's branch refs/heads/branch_7x from [~varunthacker] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5c10ec4 ] SOLR-11590: Synchronize ZK connect/disconnect handling so that they are processed in linear order (cherry picked from commit 2c14b91) > Synchronize ZK connect/disconnect handling > -- > > Key: SOLR-11590 > URL: https://issues.apache.org/jira/browse/SOLR-11590 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Noble Paul > Attachments: SOLR-11590.patch, SOLR-11590.patch > > > Here is a sequence of 2 disconnects and re-connects > {code} > 1. 2017-10-31T08:34:23.106-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected > 3. 2017-10-31T08:34:23.107-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > {code} > {code} > 1. 2017-10-31T08:36:46.541-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.549-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected > {code} > In the first disconnect the sequence is - get disconnect watcher, execute > disconnect code, execute connect code > In the second disconnect the sequence is - get disconnect watcher, execute > connect code, execute disconnect code > In the second sequence of events, if the JVM has leader replicas then all > updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." > . This starts happening exactly after 27 seconds ( zk client timeout is 30s , > 90% of 30 = 27 - when the code thinks the session is likely expired). No > leadership changes since there was no session expiry. Unless you restart the > node all updates to the system continue to fail. > These log lines correspond are from Solr 5.3 hence where the WatchedEvent was > still being logged as INFO > We process the connect code and then process the disconnect code out of order > based on the log ordering. The connection is active but the flag is not set > and hence after 27 seconds {{zkCheck}} starts complaining that the connection > is likely expired > A related Jira is SOLR-5721 > ZK gives us ordered watch events ( > https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees > ) but from what I understand Solr can still process them out of order. We > could take a lock and synchronize {{ConnectionManager#connected}} and > {{ConnectionManager#disconnected}} . > Would that be the right approach to take? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11590) Synchronize ZK connect/disconnect handling
[ https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279286#comment-16279286 ] ASF subversion and git services commented on SOLR-11590: Commit 2c14b91418b45c42aba98ea2e612e9c0a53a0948 in lucene-solr's branch refs/heads/master from [~varunthacker] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2c14b91 ] SOLR-11590: Synchronize ZK connect/disconnect handling so that they are processed in linear order > Synchronize ZK connect/disconnect handling > -- > > Key: SOLR-11590 > URL: https://issues.apache.org/jira/browse/SOLR-11590 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Noble Paul > Attachments: SOLR-11590.patch, SOLR-11590.patch > > > Here is a sequence of 2 disconnects and re-connects > {code} > 1. 2017-10-31T08:34:23.106-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected > 3. 2017-10-31T08:34:23.107-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > {code} > {code} > 1. 2017-10-31T08:36:46.541-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.549-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected > {code} > In the first disconnect the sequence is - get disconnect watcher, execute > disconnect code, execute connect code > In the second disconnect the sequence is - get disconnect watcher, execute > connect code, execute disconnect code > In the second sequence of events, if the JVM has leader replicas then all > updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." > . This starts happening exactly after 27 seconds ( zk client timeout is 30s , > 90% of 30 = 27 - when the code thinks the session is likely expired). No > leadership changes since there was no session expiry. Unless you restart the > node all updates to the system continue to fail. > These log lines correspond are from Solr 5.3 hence where the WatchedEvent was > still being logged as INFO > We process the connect code and then process the disconnect code out of order > based on the log ordering. The connection is active but the flag is not set > and hence after 27 seconds {{zkCheck}} starts complaining that the connection > is likely expired > A related Jira is SOLR-5721 > ZK gives us ordered watch events ( > https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees > ) but from what I understand Solr can still process them out of order. We > could take a lock and synchronize {{ConnectionManager#connected}} and > {{ConnectionManager#disconnected}} . > Would that be the right approach to take? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11590) Synchronize ZK connect/disconnect handling
[ https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238276#comment-16238276 ] Scott Blum commented on SOLR-11590: --- LGTM > Synchronize ZK connect/disconnect handling > -- > > Key: SOLR-11590 > URL: https://issues.apache.org/jira/browse/SOLR-11590 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Assignee: Noble Paul >Priority: Major > Attachments: SOLR-11590.patch > > > Here is a sequence of 2 disconnects and re-connects > {code} > 1. 2017-10-31T08:34:23.106-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected > 3. 2017-10-31T08:34:23.107-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > {code} > {code} > 1. 2017-10-31T08:36:46.541-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.549-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected > {code} > In the first disconnect the sequence is - get disconnect watcher, execute > disconnect code, execute connect code > In the second disconnect the sequence is - get disconnect watcher, execute > connect code, execute disconnect code > In the second sequence of events, if the JVM has leader replicas then all > updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." > . This starts happening exactly after 27 seconds ( zk client timeout is 30s , > 90% of 30 = 27 - when the code thinks the session is likely expired). No > leadership changes since there was no session expiry. Unless you restart the > node all updates to the system continue to fail. > These log lines correspond are from Solr 5.3 hence where the WatchedEvent was > still being logged as INFO > We process the connect code and then process the disconnect code out of order > based on the log ordering. The connection is active but the flag is not set > and hence after 27 seconds {{zkCheck}} starts complaining that the connection > is likely expired > A related Jira is SOLR-5721 > ZK gives us ordered watch events ( > https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees > ) but from what I understand Solr can still process them out of order. We > could take a lock and synchronize {{ConnectionManager#connected}} and > {{ConnectionManager#disconnected}} . > Would that be the right approach to take? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11590) Synchronize ZK connect/disconnect handling
[ https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236419#comment-16236419 ] Varun Thacker commented on SOLR-11590: -- SOLR-6261 is another Jira that's relevant here. We added a thread pool to execute the watch event callbacks > Synchronize ZK connect/disconnect handling > -- > > Key: SOLR-11590 > URL: https://issues.apache.org/jira/browse/SOLR-11590 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Varun Thacker >Priority: Major > > Here is a sequence of 2 disconnects and re-connects > {code} > 1. 2017-10-31T08:34:23.106-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected > 3. 2017-10-31T08:34:23.107-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > {code} > {code} > 1. 2017-10-31T08:36:46.541-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.549-0700 Watcher > org.apache.solr.common.cloud.ConnectionManager@1579ca20 > name:ZooKeeperConnection Watcher:host:port got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected > {code} > In the first disconnect the sequence is - get disconnect watcher, execute > disconnect code, execute connect code > In the second disconnect the sequence is - get disconnect watcher, execute > connect code, execute disconnect code > In the second sequence of events, if the JVM has leader replicas then all > updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." > . This starts happening exactly after 27 seconds ( zk client timeout is 30s , > 90% of 30 = 27 - when the code thinks the session is likely expired). No > leadership changes since there was no session expiry. Unless you restart the > node all updates to the system continue to fail. > These log lines correspond are from Solr 5.3 hence where the WatchedEvent was > still being logged as INFO > We process the connect code and then process the disconnect code out of order > based on the log ordering. The connection is active but the flag is not set > and hence after 27 seconds {{zkCheck}} starts complaining that the connection > is likely expired > A related Jira is SOLR-5721 > ZK gives us ordered watch events ( > https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees > ) but from what I understand Solr can still process them out of order. We > could take a lock and synchronize {{ConnectionManager#connected}} and > {{ConnectionManager#disconnected}} . > Would that be the right approach to take? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org