Re: TestRollingRestart fail occasionally

Ted Yu Thu, 11 Aug 2011 09:13:00 -0700

I will give this higher priority if test fails on Jenkins.

FYI


On Wed, Aug 10, 2011 at 7:01 PM, Zhoushuaifeng <[email protected]>wrote:

> Thanks Ted,
> It's occasionally, testing many times may occur once.
> Can you analysis the code and log and check if it's a problem?
> Whatever, I will do more test.
>
> Zhou Shuaifeng(Frank)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:[email protected]]
> Sent: Thursday, August 11, 2011 9:53 AM
> To: [email protected]
> Subject: Re: TestRollingRestart fail occasionally
>
> 0.90.4 was just released. Can you run TestRollingRestart using latest 0.90
> branch ?
>
> This test didn't fail on
> https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/
>
> Cheers
>
> On Wed, Aug 10, 2011 at 6:21 PM, Zhoushuaifeng <[email protected]
> >wrote:
>
> > Is there anyone else do this test and encounter the failure? And what's
> > your opinion?
> >
> > Zhou Shuaifeng(Frank)
> >
> > -----Original Message-----
> > From: Zhoushuaifeng [mailto:[email protected]]
> > Sent: Wednesday, August 10, 2011 10:09 AM
> > To: [email protected]
> > Subject: TestRollingRestart fail occasionally
> >
> > Hi,
> > I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log
> > shows that split log runs in to a circle, the recoverFileLease fail and
> the
> > while() never end and the test timeout and fail.
> >
> > Here are some of the logs:
> > After restarting primary master, not all the RSs connected before the
> > master stop waiting:
> > TRR: Restarting primary master
> > INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on
> > regionserver(s) count to settle; currently=3
> > 2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977]
> > master.ServerManager(660): Waiting on regionserver(s) count to settle;
> > currently=3
> > 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> > master.ServerManager(648): Finished waiting for regionserver count to
> > settle; count=3, sleptFor=4500
> > 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> > master.ServerManager(674): Exiting wait on regionserver(s) to checkin;
> > count=3, stopped=false, count of regions out on cluster=22
> > 2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977]
> > master.MasterFileSystem(180): Log folder
> > hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108
> > doesn't belong to a known region server, splitting
> >
> > But after master starting split, another RS connected:
> > 2011-07-06 09:13:54,243 INFO
> >  [RegionServer:3;linux1.site,54949,1309914772108]
> > regionserver.HRegionServer(1456): Attempting connect to Master server at
> > linux1.site:35977
> > 2011-07-06 09:13:54,243 INFO
> >  [RegionServer:3;linux1.site,54949,1309914772108]
> > regionserver.HRegionServer(1475): Connected to master at
> linux1.site:35977
> >
> > Then, split log recover lease may  encounter AlreadyBeingCreatedException
> > and show this log:
> > 2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977]
> > util.FSUtils(715): Waited 60087ms for lease recovery on
> >
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> > failed to create file
> >
> /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175
> > for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1,
> > because this file is already being created by
> > DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on
> 127.0.0.1
> >                at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
> >                at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
> >                at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
> >                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown
> > Source)
> >                at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >                at java.lang.reflect.Method.invoke(Method.java:597)
> >                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> >                at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
> >                at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
> >                at java.security.AccessController.doPrivileged(Native
> > Method)
> >                at javax.security.auth.Subject.doAs(Subject.java:396)
> >                at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
> >
> > This log shows continuing about 14 minutes and test fail.
> >
> > This test fail occasionally may because the master default waiting time
> is
> > 4500ms, usually it's enouth for all the RS to check in, but some times
> it's
> > not, and the RS check in later may disturb the recover lease.
> > This may be a bug, And may have some relation to HBASE-4177.
> >
> >
> > Zhou Shuaifeng(Frank)
> >
> >
> >
>

Re: TestRollingRestart fail occasionally

Reply via email to