[jira] [Commented] (HBASE-23919) [Flakey Test] Standalone Zookeeper won't start (minizookeepercluster won't come up)

Michael Stack (Jira) Thu, 12 Mar 2020 14:26:12 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058251#comment-17058251
 ]


Michael Stack commented on HBASE-23919:
---------------------------------------

Look at thread dumps from failed startup, I can see that zk is up and waiting 
for traffic to come in. I can see zk client trying to connect to the server.

Here is the server polling (NIOServerCnxnFactory is the host for zk server):

{code}
 Thread 612 (NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50450):
   State: RUNNABLE
   Blocked count: 0
   Waited count: 0
   Stack:
     sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
     sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
     sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
     
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
     java.lang.Thread.run(Thread.java:745)
{code}

... then here is from logs where server config is logged:
{code}
 2020-03-12 09:50:06,245 INFO  [Time-limited test] 
zookeeper.MiniZooKeeperCluster(256): Started connectionTimeout=30000, 
dir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/66bc0d64-9bbf-
 
21d0-aece-af8653f577f6/cluster_5c4e400a-67e0-b80a-7c38-f068d02c6967/zookeeper_0,
 clientPort=50450, 
dataDir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/66bc0d64-9bbf-21d0-aece-
        
af8653f577f6/cluster_5c4e400a-67e0-b80a-7c38-f068d02c6967/zookeeper_0/version-2,
 
dataLogDir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/66bc0d64-9bbf-21d0-aece-af8653f577f6/
          cluster_5c4e400a-67e0-b80a-7c38-f068d02c6967/zookeeper_0/version-2, 
tickTime=2000, maxClientCnxns=300, minSessionTimeout=4000, 
maxSessionTimeout=40000, serverId=0
{code}
... then clients fail trying to connect:
{code}
2020-03-12 09:50:07,251 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:08,260 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:09,268 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:10,276 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:11,286 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:12,295 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:13,300 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:14,308 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:15,314 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:16,317 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:17,322 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read
 2020-03-12 09:50:18,327 DEBUG [Time-limited test] 
zookeeper.MiniZooKeeperCluster(446): Read...
{code}

etc.

This particular fail in TestServerLoadDurability is interesting because it 
parameterized and zk worked fine the first time through. It only failed on 
second time through.

Let me try using loopback....HBASE-23796 ... to see if it helps.

> [Flakey Test] Standalone Zookeeper won't start (minizookeepercluster won't 
> come up)
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-23919
>                 URL: https://issues.apache.org/jira/browse/HBASE-23919
>             Project: HBase
>          Issue Type: Bug
>          Components: flakies
>            Reporter: Michael Stack
>            Priority: Major
>
> I've seen this on occasion across different hardwares; the standalone 
> zookeeper won't come up and then random unit test fails in its junit startup 
> phase as part of launching mini cluster.
> I've been trying to track this w/ a while now adding in logging, using more 
> of zk client instead of the copy/paste we've had a long while now, and adding 
> in logging (I've been running locally w/ zk logging set to INFO).
> It looks like this currently where the last thing out of the server launch is 
> this....
> {code:java}
>   2020-03-02 07:57:46,129 INFO  [Time-limited test] 
> server.ZooKeeperServer(854): maxSessionTimeout set to -1                      
>                                                                               
>                         2020-03-02 07:57:46,139 INFO  [Time-limited test] 
> server.NIOServerCnxnFactory(89): binding to port 0.0.0.0/0.0.0.0:49316        
>                                                                               
>                         2020-03-02 07:57:46,181 INFO  [Time-limited test] 
> zookeeper.MiniZooKeeperCluster(256): Started connectionTimeout=30000, 
> dir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/89d57393-fe97-9200-630f-7843ee406bd2/
>       cluster_6b4d6f67-7978-dc67-a1a3-7b1b0c0e4268/zookeeper_0, 
> clientPort=49316, 
> dataDir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/89d57393-fe97-9200-630f-7843ee406bd2/cluster_6b4d6f67-7978-dc67-a1a3-7b1b0c0e4268/
>  zookeeper_0/version-2, 
> dataLogDir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/89d57393-fe97-9200-630f-7843ee406bd2/cluster_6b4d6f67-7978-dc67-a1a3-7b1b0c0e4268/zookeeper_0/version-2,
>  tickTime=2000,              maxClientCnxns=300, minSessionTimeout=4000, 
> maxSessionTimeout=40000, serverId=0{code}
> ... then the client just does this over and over:
> {code:java}
>  2020-03-02 07:57:46,182 INFO  [Time-limited test] 
> client.FourLetterWordMain(65): connecting to localhost 49316                  
>                                                                               
>                         2020-03-02 07:57:46,213 INFO  [Time-limited test] 
> zookeeper.MiniZooKeeperCluster(453): localhost:49316 not up                   
>                                                                               
>                         java.net.SocketException: Connection reset            
>                                                                               
>                                                                               
>                       at 
> java.net.SocketInputStream.read(SocketInputStream.java:209)                   
>                                                                               
>                                                                        at 
> java.net.SocketInputStream.read(SocketInputStream.java:141)                   
>                                                                               
>                                                                        at 
> sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)                    
>                                                                               
>                                                                        at 
> sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)                     
>                                                                               
>                                                                        at 
> sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)                         
>                                                                               
>                                                                        at 
> java.io.InputStreamReader.read(InputStreamReader.java:184)                    
>                                                                               
>                                                                        at 
> java.io.BufferedReader.fill(BufferedReader.java:161)                          
>                                                                               
>                                                                        at 
> java.io.BufferedReader.readLine(BufferedReader.java:324)                      
>                                                                               
>                                                                        at 
> java.io.BufferedReader.readLine(BufferedReader.java:389)                      
>                                                                               
>                                                                        at 
> org.apache.zookeeper.client.FourLetterWordMain.send4LetterWord(FourLetterWordMain.java:84)
>                                                                               
>                                                            at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.waitForServerUp(MiniZooKeeperCluster.java:442)
>                                                                               
>                                                 at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.startup(MiniZooKeeperCluster.java:259)
>                                                                               
>                                                         at 
> org.apache.hadoop.hbase.HBaseZKTestingUtility.startMiniZKCluster(HBaseZKTestingUtility.java:130)
>                                                                               
>                                                      at 
> org.apache.hadoop.hbase.HBaseZKTestingUtility.startMiniZKCluster(HBaseZKTestingUtility.java:103)
>                                                                               
>                                                      at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1107)
>                                                                               
>                                                           at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1065)
>                                                                               
>                                                           at 
> org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable.before(TestRegionReplicasWithModifyTable.java:62)
>                                                                               
>                              at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                   
>                                                                               
>                                                                        at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>                                                                               
>                                                                        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>                                                                               
>                                                                 at 
> java.lang.reflect.Method.invoke(Method.java:498)                              
>                                                                               
>                                                                        at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>                                                                               
>                                                                  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>                                                                               
>                                                                   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>                                                                               
>                                                                    at 
> org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
>                                                                               
>                                                                     at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
>                                                                               
>                                                                        at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)   
>                                                                               
>                                                                        at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>                                                                               
>                                                    at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>                                                                               
>                                                    at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)                      
>                                                                               
>                                                                        at 
> java.lang.Thread.run(Thread.java:745)
>  {code}
>  
> There is ZOOKEEPER-2714 but it is closed as not-reproducible and it looks a 
> little different going by the log emissions in that it registers the client 
> connections.
> Noting this here in case others have seen similar or there are ideas out 
> there for how to fix.
> Here for example is how the failure might look high-level:
> {code:java}
>  
> -------------------------------------------------------------------------------
>  Test set: 
> org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable
>  
> -------------------------------------------------------------------------------
>  Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.017 s <<< 
> FAILURE! - in 
> org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable
>  org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable  Time 
> elapsed: 0.01 s  <<< ERROR!
>  java.io.IOException: Waiting for startup of standalone server; server 
> isRunning=true
>    at 
> org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable.before(TestRegionReplicasWithModifyTable.java:62)
>  {code}
> ... and then when you dig in, the test never even made it past cluster launch 
> in setup. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23919) [Flakey Test] Standalone Zookeeper won't start (minizookeepercluster won't come up)

Reply via email to