If you update the hadoop that hbase ships with to the tip of the
branch-0.20-append, does it fail then?  The tip has hdfs-1554,
hdfs-1554 whereas what hbase ships with does not.

St.Ack

On Tue, Aug 9, 2011 at 7:09 PM, Zhoushuaifeng <[email protected]> wrote:
> Hi,
> I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log 
> shows that split log runs in to a circle, the recoverFileLease fail and the 
> while() never end and the test timeout and fail.
>
> Here are some of the logs:
> After restarting primary master, not all the RSs connected before the master 
> stop waiting:
> TRR: Restarting primary master
> INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on 
> regionserver(s) count to settle; currently=3
> 2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977] 
> master.ServerManager(660): Waiting on regionserver(s) count to settle; 
> currently=3
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] 
> master.ServerManager(648): Finished waiting for regionserver count to settle; 
> count=3, sleptFor=4500
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] 
> master.ServerManager(674): Exiting wait on regionserver(s) to checkin; 
> count=3, stopped=false, count of regions out on cluster=22
> 2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977] 
> master.MasterFileSystem(180): Log folder 
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108 
> doesn't belong to a known region server, splitting
>
> But after master starting split, another RS connected:
> 2011-07-06 09:13:54,243 INFO  
> [RegionServer:3;linux1.site,54949,1309914772108] 
> regionserver.HRegionServer(1456): Attempting connect to Master server at 
> linux1.site:35977
> 2011-07-06 09:13:54,243 INFO  
> [RegionServer:3;linux1.site,54949,1309914772108] 
> regionserver.HRegionServer(1475): Connected to master at linux1.site:35977
>
> Then, split log recover lease may  encounter AlreadyBeingCreatedException and 
> show this log:
> 2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977] util.FSUtils(715): 
> Waited 60087ms for lease recovery on 
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
>  failed to create file 
> /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175
>  for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1, 
> because this file is already being created by 
> DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on 127.0.0.1
>                at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
>                at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
>                at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
>                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>                at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>                at java.lang.reflect.Method.invoke(Method.java:597)
>                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
>                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
>                at java.security.AccessController.doPrivileged(Native Method)
>                at javax.security.auth.Subject.doAs(Subject.java:396)
>                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
>
> This log shows continuing about 14 minutes and test fail.
>
> This test fail occasionally may because the master default waiting time is 
> 4500ms, usually it's enouth for all the RS to check in, but some times it's 
> not, and the RS check in later may disturb the recover lease.
> This may be a bug, And may have some relation to HBASE-4177.
>
>
> Zhou Shuaifeng(Frank)
>
>
>

Reply via email to