[
https://issues.apache.org/jira/browse/HBASE-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Stack updated HBASE-23919:
----------------------------------
Component/s: flakies
Description:
I've seen this on occasion across different hardwares; the standalone zookeeper
won't come up and then random unit test fails in its junit startup phase as
part of launching mini cluster.
I've been trying to track this w/ a while now adding in logging, using more of
zk client instead of the copy/paste we've had a long while now, and adding in
logging (I've been running locally w/ zk logging set to INFO).
It looks like this currently where the last thing out of the server launch is
this....
{code:java}
2020-03-02 07:57:46,129 INFO [Time-limited test]
server.ZooKeeperServer(854): maxSessionTimeout set to -1
2020-03-02 07:57:46,139 INFO [Time-limited test]
server.NIOServerCnxnFactory(89): binding to port 0.0.0.0/0.0.0.0:49316
2020-03-02 07:57:46,181 INFO [Time-limited test]
zookeeper.MiniZooKeeperCluster(256): Started connectionTimeout=30000,
dir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/89d57393-fe97-9200-630f-7843ee406bd2/
cluster_6b4d6f67-7978-dc67-a1a3-7b1b0c0e4268/zookeeper_0,
clientPort=49316,
dataDir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/89d57393-fe97-9200-630f-7843ee406bd2/cluster_6b4d6f67-7978-dc67-a1a3-7b1b0c0e4268/
zookeeper_0/version-2,
dataLogDir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/89d57393-fe97-9200-630f-7843ee406bd2/cluster_6b4d6f67-7978-dc67-a1a3-7b1b0c0e4268/zookeeper_0/version-2,
tickTime=2000, maxClientCnxns=300, minSessionTimeout=4000,
maxSessionTimeout=40000, serverId=0{code}
... then the client just does this over and over:
{code:java}
2020-03-02 07:57:46,182 INFO [Time-limited test]
client.FourLetterWordMain(65): connecting to localhost 49316
2020-03-02 07:57:46,213 INFO [Time-limited test]
zookeeper.MiniZooKeeperCluster(453): localhost:49316 not up
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at
sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at
sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at
sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at
java.io.InputStreamReader.read(InputStreamReader.java:184)
at
java.io.BufferedReader.fill(BufferedReader.java:161)
at
java.io.BufferedReader.readLine(BufferedReader.java:324)
at
java.io.BufferedReader.readLine(BufferedReader.java:389)
at
org.apache.zookeeper.client.FourLetterWordMain.send4LetterWord(FourLetterWordMain.java:84)
at
org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.waitForServerUp(MiniZooKeeperCluster.java:442)
at
org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.startup(MiniZooKeeperCluster.java:259)
at
org.apache.hadoop.hbase.HBaseZKTestingUtility.startMiniZKCluster(HBaseZKTestingUtility.java:130)
at
org.apache.hadoop.hbase.HBaseZKTestingUtility.startMiniZKCluster(HBaseZKTestingUtility.java:103)
at
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1107)
at
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1065)
at
org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable.before(TestRegionReplicasWithModifyTable.java:62)
at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at
java.lang.reflect.Method.invoke(Method.java:498)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at
org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
at
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
at
java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.lang.Thread.run(Thread.java:745)
{code}
There is ZOOKEEPER-2714 but it is closed as not-reproducible and it looks a
little different going by the log emissions in that it registers the client
connections.
Noting this here in case others have seen similar or there are ideas out there
for how to fix.
Here for example is how the failure might look high-level:
{code:java}
-------------------------------------------------------------------------------
Test set:
org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable
-------------------------------------------------------------------------------
Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.017 s <<<
FAILURE! - in
org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable
org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable Time
elapsed: 0.01 s <<< ERROR!
java.io.IOException: Waiting for startup of standalone server; server
isRunning=true
at
org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable.before(TestRegionReplicasWithModifyTable.java:62)
{code}
... and then when you dig in, the test never even made it past cluster launch
in setup.
> [Flakey Test] Standalone Zookeeper won't start (minizookeepercluster won't
> come up)
> -----------------------------------------------------------------------------------
>
> Key: HBASE-23919
> URL: https://issues.apache.org/jira/browse/HBASE-23919
> Project: HBase
> Issue Type: Bug
> Components: flakies
> Reporter: Michael Stack
> Priority: Major
>
> I've seen this on occasion across different hardwares; the standalone
> zookeeper won't come up and then random unit test fails in its junit startup
> phase as part of launching mini cluster.
> I've been trying to track this w/ a while now adding in logging, using more
> of zk client instead of the copy/paste we've had a long while now, and adding
> in logging (I've been running locally w/ zk logging set to INFO).
> It looks like this currently where the last thing out of the server launch is
> this....
> {code:java}
> 2020-03-02 07:57:46,129 INFO [Time-limited test]
> server.ZooKeeperServer(854): maxSessionTimeout set to -1
>
> 2020-03-02 07:57:46,139 INFO [Time-limited test]
> server.NIOServerCnxnFactory(89): binding to port 0.0.0.0/0.0.0.0:49316
>
> 2020-03-02 07:57:46,181 INFO [Time-limited test]
> zookeeper.MiniZooKeeperCluster(256): Started connectionTimeout=30000,
> dir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/89d57393-fe97-9200-630f-7843ee406bd2/
> cluster_6b4d6f67-7978-dc67-a1a3-7b1b0c0e4268/zookeeper_0,
> clientPort=49316,
> dataDir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/89d57393-fe97-9200-630f-7843ee406bd2/cluster_6b4d6f67-7978-dc67-a1a3-7b1b0c0e4268/
> zookeeper_0/version-2,
> dataLogDir=/Users/stack/checkouts/hbase.git/hbase-server/target/test-data/89d57393-fe97-9200-630f-7843ee406bd2/cluster_6b4d6f67-7978-dc67-a1a3-7b1b0c0e4268/zookeeper_0/version-2,
> tickTime=2000, maxClientCnxns=300, minSessionTimeout=4000,
> maxSessionTimeout=40000, serverId=0{code}
> ... then the client just does this over and over:
> {code:java}
> 2020-03-02 07:57:46,182 INFO [Time-limited test]
> client.FourLetterWordMain(65): connecting to localhost 49316
>
> 2020-03-02 07:57:46,213 INFO [Time-limited test]
> zookeeper.MiniZooKeeperCluster(453): localhost:49316 not up
>
> java.net.SocketException: Connection reset
>
>
> at
> java.net.SocketInputStream.read(SocketInputStream.java:209)
>
> at
> java.net.SocketInputStream.read(SocketInputStream.java:141)
>
> at
> sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>
> at
> sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>
> at
> sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>
> at
> java.io.InputStreamReader.read(InputStreamReader.java:184)
>
> at
> java.io.BufferedReader.fill(BufferedReader.java:161)
>
> at
> java.io.BufferedReader.readLine(BufferedReader.java:324)
>
> at
> java.io.BufferedReader.readLine(BufferedReader.java:389)
>
> at
> org.apache.zookeeper.client.FourLetterWordMain.send4LetterWord(FourLetterWordMain.java:84)
>
> at
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.waitForServerUp(MiniZooKeeperCluster.java:442)
>
> at
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.startup(MiniZooKeeperCluster.java:259)
>
> at
> org.apache.hadoop.hbase.HBaseZKTestingUtility.startMiniZKCluster(HBaseZKTestingUtility.java:130)
>
> at
> org.apache.hadoop.hbase.HBaseZKTestingUtility.startMiniZKCluster(HBaseZKTestingUtility.java:103)
>
> at
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1107)
>
> at
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1065)
>
> at
> org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable.before(TestRegionReplicasWithModifyTable.java:62)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at
> java.lang.reflect.Method.invoke(Method.java:498)
>
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>
> at
> org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
>
> at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
>
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>
> at
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
>
> at
> java.lang.Thread.run(Thread.java:745)
> {code}
>
> There is ZOOKEEPER-2714 but it is closed as not-reproducible and it looks a
> little different going by the log emissions in that it registers the client
> connections.
> Noting this here in case others have seen similar or there are ideas out
> there for how to fix.
> Here for example is how the failure might look high-level:
> {code:java}
>
> -------------------------------------------------------------------------------
> Test set:
> org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable
>
> -------------------------------------------------------------------------------
> Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.017 s <<<
> FAILURE! - in
> org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable
> org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable Time
> elapsed: 0.01 s <<< ERROR!
> java.io.IOException: Waiting for startup of standalone server; server
> isRunning=true
> at
> org.apache.hadoop.hbase.regionserver.TestRegionReplicasWithModifyTable.before(TestRegionReplicasWithModifyTable.java:62)
> {code}
> ... and then when you dig in, the test never even made it past cluster launch
> in setup.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)