TestRollingRestart fail occasionally

Zhoushuaifeng Tue, 09 Aug 2011 19:11:00 -0700

Hi,
I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log shows 
that split log runs in to a circle, the recoverFileLease fail and the while() 
never end and the test timeout and fail.


Here are some of the logs:
After restarting primary master, not all the RSs connected before the master 
stop waiting:
TRR: Restarting primary master
INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on 
regionserver(s) count to settle; currently=3
2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977] 
master.ServerManager(660): Waiting on regionserver(s) count to settle; 
currently=3
2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] 
master.ServerManager(648): Finished waiting for regionserver count to settle; 
count=3, sleptFor=4500
2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] 
master.ServerManager(674): Exiting wait on regionserver(s) to checkin; count=3, 
stopped=false, count of regions out on cluster=22
2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977] 
master.MasterFileSystem(180): Log folder 
hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108 doesn't 
belong to a known region server, splitting

But after master starting split, another RS connected:
2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] 
regionserver.HRegionServer(1456): Attempting connect to Master server at 
linux1.site:35977
2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] 
regionserver.HRegionServer(1475): Connected to master at linux1.site:35977

Then, split log recover lease may  encounter AlreadyBeingCreatedException and 
show this log:
2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977] util.FSUtils(715): 
Waited 60087ms for lease recovery on 
hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
 failed to create file 
/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175
 for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1, 
because this file is already being created by 
DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on 127.0.0.1
                at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
                at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
                at 
org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
                at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:396)
                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)

This log shows continuing about 14 minutes and test fail.

This test fail occasionally may because the master default waiting time is 
4500ms, usually it's enouth for all the RS to check in, but some times it's 
not, and the RS check in later may disturb the recover lease.
This may be a bug, And may have some relation to HBASE-4177.


Zhou Shuaifeng(Frank)

TestRollingRestart fail occasionally

Reply via email to