[jira] [Commented] (SOLR-11590) Synchronize ZK connect/disconnect handling

2017-12-05 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279288#comment-16279288
 ] 

ASF subversion and git services commented on SOLR-11590:


Commit 5c10ec49af582d83422266b7357f0b50023b939b in lucene-solr's branch 
refs/heads/branch_7x from [~varunthacker]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5c10ec4 ]

SOLR-11590: Synchronize ZK connect/disconnect handling so that they are 
processed in linear order

(cherry picked from commit 2c14b91)


> Synchronize ZK connect/disconnect handling
> --
>
> Key: SOLR-11590
> URL: https://issues.apache.org/jira/browse/SOLR-11590
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Noble Paul
> Attachments: SOLR-11590.patch, SOLR-11590.patch
>
>
> Here is a sequence of 2 disconnects and re-connects
> {code}
> 1. 2017-10-31T08:34:23.106-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
> 3. 2017-10-31T08:34:23.107-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> {code}
> {code}
> 1. 2017-10-31T08:36:46.541-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.549-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
> {code}
> In the first disconnect the sequence is -  get disconnect watcher, execute 
> disconnect code, execute connect code
> In the second disconnect the sequence is - get disconnect watcher, execute 
> connect code, execute disconnect code
> In the second sequence of events, if the JVM has leader replicas then all 
> updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." 
> . This starts happening exactly after 27 seconds ( zk client timeout is 30s , 
> 90% of 30 = 27 - when the code thinks the session is likely expired). No 
> leadership changes since there was no session expiry. Unless you restart the 
> node all updates to the system continue to fail.
> These log lines correspond are from Solr 5.3 hence where the WatchedEvent was 
> still being logged as INFO
> We process the connect code and then process the disconnect code out of order 
> based on the log ordering. The connection is active but the flag is not set 
> and hence after 27 seconds {{zkCheck}} starts complaining that the connection 
> is likely expired
> A related Jira is SOLR-5721
> ZK gives us ordered watch events ( 
> https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees
>  ) but from what I understand Solr can still process them out of order. We 
> could take a lock and synchronize {{ConnectionManager#connected}} and 
> {{ConnectionManager#disconnected}} . 
> Would that be the right approach to take?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11590) Synchronize ZK connect/disconnect handling

2017-12-05 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279286#comment-16279286
 ] 

ASF subversion and git services commented on SOLR-11590:


Commit 2c14b91418b45c42aba98ea2e612e9c0a53a0948 in lucene-solr's branch 
refs/heads/master from [~varunthacker]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2c14b91 ]

SOLR-11590: Synchronize ZK connect/disconnect handling so that they are 
processed in linear order


> Synchronize ZK connect/disconnect handling
> --
>
> Key: SOLR-11590
> URL: https://issues.apache.org/jira/browse/SOLR-11590
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Noble Paul
> Attachments: SOLR-11590.patch, SOLR-11590.patch
>
>
> Here is a sequence of 2 disconnects and re-connects
> {code}
> 1. 2017-10-31T08:34:23.106-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
> 3. 2017-10-31T08:34:23.107-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> {code}
> {code}
> 1. 2017-10-31T08:36:46.541-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.549-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
> {code}
> In the first disconnect the sequence is -  get disconnect watcher, execute 
> disconnect code, execute connect code
> In the second disconnect the sequence is - get disconnect watcher, execute 
> connect code, execute disconnect code
> In the second sequence of events, if the JVM has leader replicas then all 
> updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." 
> . This starts happening exactly after 27 seconds ( zk client timeout is 30s , 
> 90% of 30 = 27 - when the code thinks the session is likely expired). No 
> leadership changes since there was no session expiry. Unless you restart the 
> node all updates to the system continue to fail.
> These log lines correspond are from Solr 5.3 hence where the WatchedEvent was 
> still being logged as INFO
> We process the connect code and then process the disconnect code out of order 
> based on the log ordering. The connection is active but the flag is not set 
> and hence after 27 seconds {{zkCheck}} starts complaining that the connection 
> is likely expired
> A related Jira is SOLR-5721
> ZK gives us ordered watch events ( 
> https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees
>  ) but from what I understand Solr can still process them out of order. We 
> could take a lock and synchronize {{ConnectionManager#connected}} and 
> {{ConnectionManager#disconnected}} . 
> Would that be the right approach to take?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11590) Synchronize ZK connect/disconnect handling

2017-11-03 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238276#comment-16238276
 ] 

Scott Blum commented on SOLR-11590:
---

LGTM

> Synchronize ZK connect/disconnect handling
> --
>
> Key: SOLR-11590
> URL: https://issues.apache.org/jira/browse/SOLR-11590
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Noble Paul
>Priority: Major
> Attachments: SOLR-11590.patch
>
>
> Here is a sequence of 2 disconnects and re-connects
> {code}
> 1. 2017-10-31T08:34:23.106-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
> 3. 2017-10-31T08:34:23.107-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> {code}
> {code}
> 1. 2017-10-31T08:36:46.541-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.549-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
> {code}
> In the first disconnect the sequence is -  get disconnect watcher, execute 
> disconnect code, execute connect code
> In the second disconnect the sequence is - get disconnect watcher, execute 
> connect code, execute disconnect code
> In the second sequence of events, if the JVM has leader replicas then all 
> updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." 
> . This starts happening exactly after 27 seconds ( zk client timeout is 30s , 
> 90% of 30 = 27 - when the code thinks the session is likely expired). No 
> leadership changes since there was no session expiry. Unless you restart the 
> node all updates to the system continue to fail.
> These log lines correspond are from Solr 5.3 hence where the WatchedEvent was 
> still being logged as INFO
> We process the connect code and then process the disconnect code out of order 
> based on the log ordering. The connection is active but the flag is not set 
> and hence after 27 seconds {{zkCheck}} starts complaining that the connection 
> is likely expired
> A related Jira is SOLR-5721
> ZK gives us ordered watch events ( 
> https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees
>  ) but from what I understand Solr can still process them out of order. We 
> could take a lock and synchronize {{ConnectionManager#connected}} and 
> {{ConnectionManager#disconnected}} . 
> Would that be the right approach to take?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11590) Synchronize ZK connect/disconnect handling

2017-11-02 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236419#comment-16236419
 ] 

Varun Thacker commented on SOLR-11590:
--

SOLR-6261 is another Jira that's relevant here. We added a thread pool to 
execute the watch event callbacks

> Synchronize ZK connect/disconnect handling
> --
>
> Key: SOLR-11590
> URL: https://issues.apache.org/jira/browse/SOLR-11590
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Priority: Major
>
> Here is a sequence of 2 disconnects and re-connects
> {code}
> 1. 2017-10-31T08:34:23.106-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
> 3. 2017-10-31T08:34:23.107-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> {code}
> {code}
> 1. 2017-10-31T08:36:46.541-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.549-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
> {code}
> In the first disconnect the sequence is -  get disconnect watcher, execute 
> disconnect code, execute connect code
> In the second disconnect the sequence is - get disconnect watcher, execute 
> connect code, execute disconnect code
> In the second sequence of events, if the JVM has leader replicas then all 
> updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." 
> . This starts happening exactly after 27 seconds ( zk client timeout is 30s , 
> 90% of 30 = 27 - when the code thinks the session is likely expired). No 
> leadership changes since there was no session expiry. Unless you restart the 
> node all updates to the system continue to fail.
> These log lines correspond are from Solr 5.3 hence where the WatchedEvent was 
> still being logged as INFO
> We process the connect code and then process the disconnect code out of order 
> based on the log ordering. The connection is active but the flag is not set 
> and hence after 27 seconds {{zkCheck}} starts complaining that the connection 
> is likely expired
> A related Jira is SOLR-5721
> ZK gives us ordered watch events ( 
> https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees
>  ) but from what I understand Solr can still process them out of order. We 
> could take a lock and synchronize {{ConnectionManager#connected}} and 
> {{ConnectionManager#disconnected}} . 
> Would that be the right approach to take?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org