[jira] [Comment Edited] (ZOOKEEPER-2383) Startup race in ZooKeeperServer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768373#comment-15768373 ] Flavio Junqueira edited comment on ZOOKEEPER-2383 at 12/21/16 10:36 PM: [~rakesh_r] pull request #101 merged fine onto master, but not branch 3.5. I haven't tried branch 3.4. was (Author: fpj): [~rakesh_r] pull request #101 merge fine onto master, but not branch 3.5. I haven't tried branch 3.4. > Startup race in ZooKeeperServer > --- > > Key: ZOOKEEPER-2383 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2383 > Project: ZooKeeper > Issue Type: Bug > Components: jmx, server >Affects Versions: 3.4.8 >Reporter: Steve Rowe >Assignee: Rakesh R >Priority: Blocker > Fix For: 3.4.10, 3.5.3, 3.6.0 > > Attachments: TestZkStandaloneJMXRegistrationRaceConcurrent.java, > ZOOKEEPER-2383-br-3-4.patch, ZOOKEEPER-2383.patch, ZOOKEEPER-2383.patch, > ZOOKEEPER-2383.patch, ZOOKEEPER-2383.patch, > release-3.4.8-extra-logging.patch, zk-3.4.8-MBeanRegistry.log, > zk-3.4.8-NPE.log > > > In attempting to upgrade Solr's ZooKeeper dependency from 3.4.6 to 3.4.8 > (SOLR-8724) I ran into test failures where attempts to create a node in a > newly started standalone ZooKeeperServer were failing because of an assertion > in MBeanRegistry. > ZooKeeperServer.startup() first sets up its request processor chain then > registers itself in JMX, but if a connection comes in before the server's JMX > registration happens, registration of the connection will fail because it > trips the assertion that (effectively) its parent (the server) has already > registered itself. > {code:java|title=ZooKeeperServer.java} > public synchronized void startup() { > if (sessionTracker == null) { > createSessionTracker(); > } > startSessionTracker(); > setupRequestProcessors(); > registerJMX(); > state = State.RUNNING; > notifyAll(); > } > {code} > {code:java|title=MBeanRegistry.java} > public void register(ZKMBeanInfo bean, ZKMBeanInfo parent) > throws JMException > { > assert bean != null; > String path = null; > if (parent != null) { > path = mapBean2Path.get(parent); > assert path != null; > } > {code} > This problem appears to be new with ZK 3.4.8 - AFAIK Solr never had this > issue with ZK 3.4.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (ZOOKEEPER-2383) Startup race in ZooKeeperServer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15499350#comment-15499350 ] Rakesh R edited comment on ZOOKEEPER-2383 at 9/17/16 5:29 PM: -- Thanks [~fpj] for the comments. bq. In NettyServerCnxn, it looks like we are only updating the code for 4lws. Don't we have to also update it here: I think, I have handled this case and the following condition present in [ZOOKEEPER-2383-br-3-4.patch|https://issues.apache.org/jira/secure/attachment/12821043/ZOOKEEPER-2383-br-3-4.patch]. Am I missing any other case apart from the below line of code. {code} @@ -735,7 +735,7 @@ bb.flip(); ZooKeeperServer zks = this.zkServer; -if (zks == null) { +if (zks == null || !zks.isRunning()) { throw new IOException("ZK down"); } if (initialized) { {code} bq. but it never exists as the client keeps trying to connect. It sounds like some thread is hanging and not letting the test framework exit. Without patch it fails at {{simplezks.waitForStartupInvocation(10)}}. On the other side {{SimpleZooKeeperServer#startup}} thread is waiting at startupDelayLatch.await() and this is causing an indefinite wait. How about adding the countdown logic in finally so that it will proceed: {code} try { Assert.assertFalse( "Should fail to create zk client session as server is not fully started", simplezks.waitForSessionCreation(10)); } finally { LOG.info( "Decrements the count of the latch, so that server will proceed with startup"); startupDelayLatch.countDown(); } {code} bq. This sentence doesn't make much sense to me: Since zk server is not started createsession method to be invoked I will modify this to "Should fail to create zk client session as server is not fully started" bq. Please reduce the test case timeout to no longer than 30s. Agreed. FYI, apart from the above, I had fixed [Michael's comment|https://issues.apache.org/jira/browse/ZOOKEEPER-2383?focusedCommentId=15409673&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15409673] and [Raul's comment|https://issues.apache.org/jira/browse/ZOOKEEPER-2383?focusedCommentId=15408826&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15408826] in trunk patch, need to rebase branch-3-4 patch to incorporate these comments. was (Author: rakeshr): Thanks [~fpj] for the comments. bq. In NettyServerCnxn, it looks like we are only updating the code for 4lws. Don't we have to also update it here: I think, I have handled this case and the following condition present in [ZOOKEEPER-2383-br-3-4.patch|https://issues.apache.org/jira/secure/attachment/12821043/ZOOKEEPER-2383-br-3-4.patch]. Am I missing any other case apart from the below line of code. {code} @@ -735,7 +735,7 @@ bb.flip(); ZooKeeperServer zks = this.zkServer; -if (zks == null) { +if (zks == null || !zks.isRunning()) { throw new IOException("ZK down"); } if (initialized) { {code} bq. but it never exists as the client keeps trying to connect. It sounds like some thread is hanging and not letting the test framework exit. Without patch it fails at {{simplezks.waitForStartupInvocation(10)}}. On the other side {{SimpleZooKeeperServer#startup}} thread is waiting at startupDelayLatch.await() and this is causing an indefinite wait. How about adding a timed out like below: {code} startupDelayLatch.await(15, TimeUnit.SECONDS); {code} bq. This sentence doesn't make much sense to me: Since zk server is not started createsession method to be invoked I will modify this to "Should fail to create zk client session as server is not fully started" bq. Please reduce the test case timeout to no longer than 30s. Agreed. FYI, apart from the above, I had fixed [Michael's comment|https://issues.apache.org/jira/browse/ZOOKEEPER-2383?focusedCommentId=15409673&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15409673] and [Raul's comment|https://issues.apache.org/jira/browse/ZOOKEEPER-2383?focusedCommentId=15408826&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15408826] in trunk patch, need to rebase branch-3-4 patch to incorporate these comments. > Startup race in ZooKeeperServer > --- > > Key: ZOOKEEPER-2383 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2383 > Project: ZooKeeper >
[jira] [Comment Edited] (ZOOKEEPER-2383) Startup race in ZooKeeperServer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15499350#comment-15499350 ] Rakesh R edited comment on ZOOKEEPER-2383 at 9/17/16 5:21 PM: -- Thanks [~fpj] for the comments. bq. In NettyServerCnxn, it looks like we are only updating the code for 4lws. Don't we have to also update it here: I think, I have handled this case and the following condition present in [ZOOKEEPER-2383-br-3-4.patch|https://issues.apache.org/jira/secure/attachment/12821043/ZOOKEEPER-2383-br-3-4.patch]. Am I missing any other case apart from the below line of code. {code} @@ -735,7 +735,7 @@ bb.flip(); ZooKeeperServer zks = this.zkServer; -if (zks == null) { +if (zks == null || !zks.isRunning()) { throw new IOException("ZK down"); } if (initialized) { {code} bq. but it never exists as the client keeps trying to connect. It sounds like some thread is hanging and not letting the test framework exit. Without patch it fails at {{simplezks.waitForStartupInvocation(10)}}. On the other side {{SimpleZooKeeperServer#startup}} thread is waiting at startupDelayLatch.await() and this is causing an indefinite wait. How about adding a timed out like below: {code} startupDelayLatch.await(15, TimeUnit.SECONDS); {code} bq. This sentence doesn't make much sense to me: Since zk server is not started createsession method to be invoked I will modify this to "Should fail to create zk client session as server is not fully started" bq. Please reduce the test case timeout to no longer than 30s. Agreed. FYI, apart from the above, I had fixed [Michael's comment|https://issues.apache.org/jira/browse/ZOOKEEPER-2383?focusedCommentId=15409673&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15409673] and [Raul's comment|https://issues.apache.org/jira/browse/ZOOKEEPER-2383?focusedCommentId=15408826&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15408826] in trunk patch, need to rebase branch-3-4 patch to incorporate these comments. was (Author: rakeshr): Thanks [~fpj] for the comments. bq. In NettyServerCnxn, it looks like we are only updating the code for 4lws. Don't we have to also update it here: I think, I have handled this case and the following condition present in [ZOOKEEPER-2383-br-3-4.patch|https://issues.apache.org/jira/secure/attachment/12821043/ZOOKEEPER-2383-br-3-4.patch]. Am I missing any other case apart from the below line of code. {code} @@ -735,7 +735,7 @@ bb.flip(); ZooKeeperServer zks = this.zkServer; -if (zks == null) { +if (zks == null || !zks.isRunning()) { throw new IOException("ZK down"); } if (initialized) { {code} bq. but it never exists as the client keeps trying to connect. It sounds like some thread is hanging and not letting the test framework exit. Without patch it fails at {{simplezks.waitForStartupInvocation(10)}}. On the other side {{SimpleZooKeeperServer#startup}} thread is waiting at startupDelayLatch.await() and this is causing an indefinite wait. How about adding a timed out like below: {code} startupDelayLatch.await(15, TimeUnit.SECONDS); {code} bq. This sentence doesn't make much sense to me: Since zk server is not started createsession method to be invoked I will modify this to "Failed to establish ZooKeeper client connection" bq. Please reduce the test case timeout to no longer than 30s. Agreed. FYI, apart from the above, I had fixed [Michael's comment|https://issues.apache.org/jira/browse/ZOOKEEPER-2383?focusedCommentId=15409673&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15409673] and [Raul's comment|https://issues.apache.org/jira/browse/ZOOKEEPER-2383?focusedCommentId=15408826&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15408826] in trunk patch, need to rebase branch-3-4 patch to incorporate these comments. > Startup race in ZooKeeperServer > --- > > Key: ZOOKEEPER-2383 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2383 > Project: ZooKeeper > Issue Type: Bug > Components: jmx, server >Affects Versions: 3.4.8 >Reporter: Steve Rowe >Assignee: Rakesh R >Priority: Blocker > Fix For: 3.4.10, 3.5.3, 3.6.0 > > Attachments: TestZkStandaloneJMXRegistrationRaceConcurrent.java, > ZOOKEEPER-2383-br-3-4.patch, ZOOKEEPER-2383.patch, ZOOKEEPER-2383.patch, > ZOOKEEPER-2383.patch, relea
[jira] [Comment Edited] (ZOOKEEPER-2383) Startup race in ZooKeeperServer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409828#comment-15409828 ] Rakesh R edited comment on ZOOKEEPER-2383 at 8/5/16 6:10 PM: - Thanks [~hanm] for the reviews. It looks like {{ZKUtil.java}} has ZooKeeper client related operations. Attached new patch, where I've tried keeping {{isZKServerRunning()}} in {{AbstractFourLetterCommand.java}}. Does this sound good to you? was (Author: rakeshr): Thanks [~hanm] for the reviews. Attached new patch, where I've tried keeping {{isZKServerRunning()}} in {{AbstractFourLetterCommand.java}}. Does this sound good to you? > Startup race in ZooKeeperServer > --- > > Key: ZOOKEEPER-2383 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2383 > Project: ZooKeeper > Issue Type: Bug > Components: jmx, server >Affects Versions: 3.4.8 >Reporter: Steve Rowe >Assignee: Rakesh R >Priority: Blocker > Fix For: 3.4.9, 3.5.3, 3.6.0 > > Attachments: TestZkStandaloneJMXRegistrationRaceConcurrent.java, > ZOOKEEPER-2383-br-3-4.patch, ZOOKEEPER-2383.patch, ZOOKEEPER-2383.patch, > ZOOKEEPER-2383.patch, release-3.4.8-extra-logging.patch, > zk-3.4.8-MBeanRegistry.log, zk-3.4.8-NPE.log > > > In attempting to upgrade Solr's ZooKeeper dependency from 3.4.6 to 3.4.8 > (SOLR-8724) I ran into test failures where attempts to create a node in a > newly started standalone ZooKeeperServer were failing because of an assertion > in MBeanRegistry. > ZooKeeperServer.startup() first sets up its request processor chain then > registers itself in JMX, but if a connection comes in before the server's JMX > registration happens, registration of the connection will fail because it > trips the assertion that (effectively) its parent (the server) has already > registered itself. > {code:java|title=ZooKeeperServer.java} > public synchronized void startup() { > if (sessionTracker == null) { > createSessionTracker(); > } > startSessionTracker(); > setupRequestProcessors(); > registerJMX(); > state = State.RUNNING; > notifyAll(); > } > {code} > {code:java|title=MBeanRegistry.java} > public void register(ZKMBeanInfo bean, ZKMBeanInfo parent) > throws JMException > { > assert bean != null; > String path = null; > if (parent != null) { > path = mapBean2Path.get(parent); > assert path != null; > } > {code} > This problem appears to be new with ZK 3.4.8 - AFAIK Solr never had this > issue with ZK 3.4.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (ZOOKEEPER-2383) Startup race in ZooKeeperServer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185323#comment-15185323 ] Flavio Junqueira edited comment on ZOOKEEPER-2383 at 3/8/16 5:39 PM: - [~steve_rowe] Thanks for reporting this issue. According to git blame, the latest changes around the startup method in ZooKeeperServer are due to ZOOKEEPER-1907, which actually turned out to be quite problematic, so this could be another issue due to that patch, I'm not sure. {noformat} 91f579e4 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Hongchao Deng 2015-08-17 20:52:07 + 411) public synchronized void startup() { 55b03fce src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Mahadev Konar 2012-01-31 06:50:06 + 412) if (sessionTracker == null) { 55b03fce src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Mahadev Konar 2012-01-31 06:50:06 + 413) createSessionTracker(); 55b03fce src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Mahadev Konar 2012-01-31 06:50:06 + 414) } 55b03fce src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Mahadev Konar 2012-01-31 06:50:06 + 415) startSessionTracker(); 097b7979 zookeeper/java/src/com/yahoo/zookeeper/server/ZooKeeperServer.java (Benjamin Reed 2008-05-12 23:01:25 + 416) setupRequestProcessors(); 87e1e030 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Patrick D. Hunt2009-01-15 22:57:14 + 417) 87e1e030 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Patrick D. Hunt2009-01-15 22:57:14 + 418) registerJMX(); 87e1e030 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Patrick D. Hunt2009-01-15 22:57:14 + 419) 91f579e4 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Hongchao Deng 2015-08-17 20:52:07 + 420) state = State.RUNNING; 91f579e4 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Hongchao Deng 2015-08-17 20:52:07 + 421) notifyAll(); 097b7979 zookeeper/java/src/com/yahoo/zookeeper/server/ZooKeeperServer.java (Benjamin Reed 2008-05-12 23:01:25 + 422) } {noformat} {noformat} commit 91f579e40755de870ed9123c8fd55925517d9aa6 Author: Hongchao Deng Date: Mon Aug 17 20:52:07 2015 + ZOOKEEPER-1907 Improve Thread handling (Rakesh R via hdeng) git-svn-id: https://svn.apache.org/repos/asf/zookeeper/branches/branch-3.4@1696337 13f79535-47bb-0310-9956-ffa450edef68 {noformat} [~rakesh_r] could you have a look, please? CC [~rgs] [~phunt] was (Author: fpj): [~steve_rowe] Thanks for reporting this issue. According to git blame, the latest changes around the startup method in ZooKeeperServer is due to ZOOKEEPER-1907, which actually turned out to be quite problematic, so this could be another issue due to that patch, I'm not sure. {noformat} 91f579e4 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Hongchao Deng 2015-08-17 20:52:07 + 411) public synchronized void startup() { 55b03fce src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Mahadev Konar 2012-01-31 06:50:06 + 412) if (sessionTracker == null) { 55b03fce src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Mahadev Konar 2012-01-31 06:50:06 + 413) createSessionTracker(); 55b03fce src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Mahadev Konar 2012-01-31 06:50:06 + 414) } 55b03fce src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Mahadev Konar 2012-01-31 06:50:06 + 415) startSessionTracker(); 097b7979 zookeeper/java/src/com/yahoo/zookeeper/server/ZooKeeperServer.java (Benjamin Reed 2008-05-12 23:01:25 + 416) setupRequestProcessors(); 87e1e030 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Patrick D. Hunt2009-01-15 22:57:14 + 417) 87e1e030 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Patrick D. Hunt2009-01-15 22:57:14 + 418) registerJMX(); 87e1e030 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Patrick D. Hunt2009-01-15 22:57:14 + 419) 91f579e4 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Hongchao Deng 2015-08-17 20:52:07 + 420) state = State.RUNNING; 91f579e4 src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java (Hongchao Deng 2015-08-17 20:52:07 + 421) notifyAll(); 097b7979 zookeeper/java/src/com/yahoo/zookeeper/server/ZooKe