[
https://issues.apache.org/jira/browse/HBASE-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Purtell resolved HBASE-6144.
-----------------------------------
Resolution: Cannot Reproduce
Release Note: (was: Underlying hadoop is 0.22)
Reopen if still an issue with current code
> Master mistakenly splits live server's HLog file
> ------------------------------------------------
>
> Key: HBASE-6144
> URL: https://issues.apache.org/jira/browse/HBASE-6144
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.92.0
> Reporter: Ted Yu
>
> RS abcdn0590 is live, but Master does not have it on its onlineserver list.
> So, Master put up the hlog for splitting as shown in the Master log below:
> {code}
> 2012-05-17 21:43:57,692 INFO org.apache.hadoop.hbase.master.SplitLogManager:
> task
> /hbase/splitlog/hdfs%3A%2F%2Fnamenode.xyz.com%2Fhbase%2F.logs%2Fabcdn0590.xyz.com%2C60020%2C1337315957185-splitting%2Fabcdn0590.xyz.com%252C60020%252C1337315957185.1337315957711
> acquired by abcdn0770.xyz.com,60020,1337315956278.
> {code}
> After splitting succeeded, Master deleted the file:
> {code}
> 2012-05-17 21:43:58,721 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted
> /hbase/splitlog/hdfs%3A%2F%2Fnamenode.xyz.com%2Fhbase%2F.logs%2Fabcdn0590.xyz.com%2C60020%2C1337315957185-splitting%2Fabcdn0590.xyz.com%252C60020%252C1337315957185.1337315957711
> {code}
> RS abcdn0590 lost the lease to RS abcdn0770, and try to do a Log Roller which
> closes the current hlog, and create a new one, as shown in the namenode log:
> {code}
> 2012-05-17 21:43:58,422 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(newblock=blk_2867982016684075739_12741027,
> file=/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting/abcdn0590.xyz.com%2C60020%2C1337315957185.1337315957711,
> newgenerationstamp=12911920, newlength=134, newtargets=[10.115.13.24:50010,
> 10.115.15.46:50010, 10.115.15.23:50010]) successful
> 2012-05-17 21:43:59,883 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/abcdn0590.xyz.com%2C60020%2C1337315957185.1337316238882.
> blk_3811725326431482476_12913541{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[10.115.13.24:50010|RBW],
> ReplicaUnderConstruction[10.115.17.18:50010|RBW],
> ReplicaUnderConstruction[10.115.17.15:50010|RBW]]}
> {code}
>
> When RS 0590 try to close the old hlog 1337315957711, it received fatal error
> below due to the original hlog is already deleted. The fatal error will cause
> RS abcdn0590 to shutdown itself later.
> {code}
> 2012-05-17 21:43:58,889 ERROR org.apache.hadoop.hbase.master.HMaster: Region
> server ^@^@abcdn0590.xyz.com,60020,1337315957185 reported a fatal error:
> ABORTING region server abcdn0590.xyz.com,60020,1337315957185: IOE in log
> roller
> Cause:
> java.io.FileNotFoundException: File does not exist:
> hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/abcdn0590.xyz.com%2C60020%2C1337315957185.1337315957711
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:742)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:583)
> at
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
> {code}
>
> RS abcdn0590 shutdown at around 21:44. But in the /hbase/.logs dir, it left
> two sub folder for the RS abcdn0590 with the same startcode 1337315957185 ,
> they are
> · /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting/
> · /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/
>
> Later on, at around 21:46:30, Master retry log splitting, this time, it
> still consider RS abcdn0590 as dead RS and try to put up its hlog for others
> to grab and split. It finds the folder
> /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/, and the first step it
> does is to rename it to adding suffix of –splitting. However, the same
> folder already exist. The rename function does not handle the case where the
> destination folder already exist, instead, the behavior is putting the src
> folder under the dst folder, so the path structure looks like dst/src/file.
> In our case, It is
> /hbase/.logs.20120518.1204/abcdn0590.xyz.com,60020,1337315957185-splitting/abcdn0590.xyz.com,60020,1337315957185/abcdn0590.xyz.com%2C60020%2C1337315957185.1337316238882.
>
> This is from the master log, we can see that two folders for the same RS 0590
> at same startcode exists:
> {code}
> 2012-05-17 21:46:30,749 INFO org.apache.hadoop.hbase.master.MasterFileSystem:
> Log folder
> hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1329941607395-splitting
> doesn't belong to a known region server, splitting
> 2012-05-17 21:46:30,749 INFO org.apache.hadoop.hbase.master.MasterFileSystem:
> Log folder
> hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185
> doesn't belong to a known region server, splitting
> 2012-05-17 21:46:30,749 INFO org.apache.hadoop.hbase.master.MasterFileSystem:
> Log folder
> hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting
> doesn't belong to a known region server, splitting
>
> 2012-05-17 21:46:30,962 DEBUG
> org.apache.hadoop.hbase.master.MasterFileSystem: Renamed region directory:
> hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting
> {code}
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)