I will give this higher priority if test fails on Jenkins. FYI
On Wed, Aug 10, 2011 at 7:01 PM, Zhoushuaifeng <[email protected]>wrote: > Thanks Ted, > It's occasionally, testing many times may occur once. > Can you analysis the code and log and check if it's a problem? > Whatever, I will do more test. > > Zhou Shuaifeng(Frank) > > > -----Original Message----- > From: Ted Yu [mailto:[email protected]] > Sent: Thursday, August 11, 2011 9:53 AM > To: [email protected] > Subject: Re: TestRollingRestart fail occasionally > > 0.90.4 was just released. Can you run TestRollingRestart using latest 0.90 > branch ? > > This test didn't fail on > https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/ > > Cheers > > On Wed, Aug 10, 2011 at 6:21 PM, Zhoushuaifeng <[email protected] > >wrote: > > > Is there anyone else do this test and encounter the failure? And what's > > your opinion? > > > > Zhou Shuaifeng(Frank) > > > > -----Original Message----- > > From: Zhoushuaifeng [mailto:[email protected]] > > Sent: Wednesday, August 10, 2011 10:09 AM > > To: [email protected] > > Subject: TestRollingRestart fail occasionally > > > > Hi, > > I run TestRollingRestart(0.90.3), it fails occasionally. The failing log > > shows that split log runs in to a circle, the recoverFileLease fail and > the > > while() never end and the test timeout and fail. > > > > Here are some of the logs: > > After restarting primary master, not all the RSs connected before the > > master stop waiting: > > TRR: Restarting primary master > > INFO [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on > > regionserver(s) count to settle; currently=3 > > 2011-07-06 09:12:56,331 INFO [Master:0;linux1.site:35977] > > master.ServerManager(660): Waiting on regionserver(s) count to settle; > > currently=3 > > 2011-07-06 09:12:57,831 INFO [Master:0;linux1.site:35977] > > master.ServerManager(648): Finished waiting for regionserver count to > > settle; count=3, sleptFor=4500 > > 2011-07-06 09:12:57,831 INFO [Master:0;linux1.site:35977] > > master.ServerManager(674): Exiting wait on regionserver(s) to checkin; > > count=3, stopped=false, count of regions out on cluster=22 > > 2011-07-06 09:12:57,834 INFO [Master:0;linux1.site:35977] > > master.MasterFileSystem(180): Log folder > > hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108 > > doesn't belong to a known region server, splitting > > > > But after master starting split, another RS connected: > > 2011-07-06 09:13:54,243 INFO > > [RegionServer:3;linux1.site,54949,1309914772108] > > regionserver.HRegionServer(1456): Attempting connect to Master server at > > linux1.site:35977 > > 2011-07-06 09:13:54,243 INFO > > [RegionServer:3;linux1.site,54949,1309914772108] > > regionserver.HRegionServer(1475): Connected to master at > linux1.site:35977 > > > > Then, split log recover lease may encounter AlreadyBeingCreatedException > > and show this log: > > 2011-07-06 09:13:57,929 WARN [Master:0;linux1.site:35977] > > util.FSUtils(715): Waited 60087ms for lease recovery on > > > hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: > > failed to create file > > > /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175 > > for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1, > > because this file is already being created by > > DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on > 127.0.0.1 > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202) > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157) > > at > > > org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404) > > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown > > Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) > > at > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961) > > at > > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957) > > at java.security.AccessController.doPrivileged(Native > > Method) > > at javax.security.auth.Subject.doAs(Subject.java:396) > > at > org.apache.hadoop.ipc.Server$Handler.run(Server.java:955) > > > > This log shows continuing about 14 minutes and test fail. > > > > This test fail occasionally may because the master default waiting time > is > > 4500ms, usually it's enouth for all the RS to check in, but some times > it's > > not, and the RS check in later may disturb the recover lease. > > This may be a bug, And may have some relation to HBASE-4177. > > > > > > Zhou Shuaifeng(Frank) > > > > > > >
