[
https://issues.apache.org/jira/browse/HBASE-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289227#comment-13289227
]
Devaraj Das commented on HBASE-6153:
------------------------------------
Seems like the problem is a side-effect of the issue reported in HBASE-6152.
Here is what I see in the master logs when I grep for the region
8974506aa04c5a04e5cc23c11de0039d :
{noformat}
2012-05-31 18:19:36,707 DEBUG
org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Deleting region
TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/000087_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d.
from META and FS
2012-05-31 18:19:36,709 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
Deleted region
TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/000087_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d.
from META
org.apache.hadoop.hbase.DroppedSnapshotException: region:
TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/000087_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d.
Caused by: java.io.FileNotFoundException: File does not exist:
/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/f1/294a7a31f04949b8bf07682a43157b35
2012-05-31 18:35:15,969 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: moved
hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/splitlog/ip-10-190-174-211.ec2.internal,60020,1338419750242_hdfs%3A%2F%2Fip-10-140-14-134.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-68-7-146.ec2.internal%2C60020%2C1338343120038-splitting%2Fip-10-68-7-146.ec2.internal%252C60020%252C1338343120038.1338500685987/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/recovered.edits/0000000000017630961
=>
hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/recovered.edits/0000000000017630961
2012-05-31 18:36:03,813 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: moved
hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/splitlog/ip-10-190-38-165.ec2.internal,60020,1338343120042_hdfs%3A%2F%2Fip-10-140-14-134.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-68-7-146.ec2.internal%2C60020%2C1338343120038-splitting%2Fip-10-68-7-146.ec2.internal%252C60020%252C1338343120038.1338500629081/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/recovered.edits/0000000000017616838
=>
hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/recovered.edits/0000000000017616838
2012-05-31 18:36:05,426 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: moved
hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/splitlog/ip-10-190-174-211.ec2.internal,60020,1338419750242_hdfs%3A%2F%2Fip-10-140-14-134.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-68-7-146.ec2.internal%2C60020%2C1338343120038-splitting%2Fip-10-68-7-146.ec2.internal%252C60020%252C1338343120038.1338500749335/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/recovered.edits/0000000000017645273
=>
hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/recovered.edits/0000000000017645273
2012-05-31 18:36:57,930 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: moved
hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/splitlog/ip-10-190-174-211.ec2.internal,60020,1338419750242_hdfs%3A%2F%2Fip-10-140-14-134.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-68-7-146.ec2.internal%2C60020%2C1338343120038-splitting%2Fip-10-68-7-146.ec2.internal%252C60020%252C1338343120038.1338500559421/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/recovered.edits/0000000000017604560
=>
hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/recovered.edits/0000000000017604560
{noformat}
When I poked at the regionserver logs where this region was hosted, at around
the time the region was created, I saw:
{noformat}
2012-05-31 15:09:24,569 INFO
org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region
TestLoadAndVerify_1338488017181,\x145\xE5\x0DyC^N,1338489404005.106faef9e54a59c1c2b931d1fc36bdbe.
2012-05-31 15:09:24,569 DEBUG
org.apache.hadoop.hbase.regionserver.SplitTransaction:
regionserver:60020-0x3379b7614800004-0x3379b7614800004 Creating ephemeral node
for 106faef9e54a59c1c2b931d1fc36bdbe in SPLITTING state
{noformat}
and these:
{noformat}
2012-05-31 15:09:32,036 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Setting up tabledescriptor config now ...
2012-05-31 15:09:32,037 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Instantiated
TestLoadAndVerify_1338488017181,\x145\xE5\x0DyC^N,1338491364569.51c11f4eafdb5b5f0ff11ca998a07b43.
2012-05-31 15:09:32,281 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Setting up tabledescriptor config now ...
2012-05-31 15:09:32,281 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Instantiated
TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/000087_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d.
2012-05-31 15:09:32,302 DEBUG org.apache.hadoop.hbase.io.hfile.HFileWriterV2:
Initialized with CacheConfig:enabled [cacheDataOnRead=true]
[cacheDataOnWrite=false] [cacheIndexesOnWrite=false] [cacheBloomsOnWrite=false]
[cacheEvictOnClose=false] [cacheCompressed=false]
2012-05-31 15:09:32,423 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
Offlined parent region
TestLoadAndVerify_1338488017181,\x145\xE5\x0DyC^N,1338489404005.106faef9e54a59c1c2b931d1fc36bdbe.
in META
{noformat}
At some later point, some more logs in the regionserver related to the region
8974506aa04c5a04e5cc23c11de0039d:
{noformat}
2012-05-31 15:09:35,423 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Onlined
TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/000087_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d.;
next sequenceid=14606548
2012-05-31 15:09:35,423 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks for
region=TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/000087_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d.,
daughter=true
2012-05-31 15:09:35,426 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Large Compaction
requested:
regionName=TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/000087_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d.,
storeName=f1, fileCount=4, fileSize=1.1g (1.0g, 55.9m, 18.5m, 1.2m),
priority=3, time=150035421657389; Because: Opening Region;
compaction_queue=(13:0), split_queue=0
2012-05-31 15:09:35,555 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added
daughter
TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/000087_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d.,
serverName=ip-10-68-7-146.ec2.internal,60020,1338343120038
{noformat}
*However, a previous split attempt on the region
106faef9e54a59c1c2b931d1fc36bdbe had failed*, and due to the issue reported in
HBASE-6152, the master was not able to do anything about the split (and the
following log lines went on and on and on at the master):
{noformat}
2012-05-31 15:11:23,537 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Handling transition=RS_ZK_REGION_SPLIT,
server=ip-10-68-7-146.ec2.internal,60020,1338343120038,
region=106faef9e54a59c1c2b931d1fc36bdbe
2012-05-31 15:11:23,537 WARN org.apache.hadoop.hbase.master.AssignmentManager:
Region 106faef9e54a59c1c2b931d1fc36bdbe not found on server
ip-10-68-7-146.ec2.internal,60020,1338343120038; failed processing
2012-05-31 15:11:23,537 WARN org.apache.hadoop.hbase.master.AssignmentManager:
Received SPLIT for region 106faef9e54a59c1c2b931d1fc36bdbe from server
ip-10-68-7-146.ec2.internal,60020,1338343120038 but it doesn't exist anymore,
probably already processed its split
2012-05-31 15:11:23,649 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Handling transition=RS_ZK_REGION_SPLIT,
server=ip-10-68-7-146.ec2.internal,60020,1338343120038,
region=106faef9e54a59c1c2b931d1fc36bdbe
{noformat}
So I am tending to think that this anomaly threw the master off and it couldn't
do a proper disabling for the daughter region 8974506aa04c5a04e5cc23c11de0039d..
> RS aborted due to rename problem (maybe a race)
> -----------------------------------------------
>
> Key: HBASE-6153
> URL: https://issues.apache.org/jira/browse/HBASE-6153
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.92.0
> Reporter: Devaraj Das
> Assignee: Devaraj Das
>
> I had a RS crash with the following:
> 2012-05-31 18:34:42,534 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> Renaming flushed file at
> hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/.tmp/294a7a31f04949b8bf07682a43157b35
> to
> hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/f1/294a7a31f04949b8bf07682a43157b35
> 2012-05-31 18:34:42,536 WARN org.apache.hadoop.hbase.regionserver.Store:
> Unable to rename
> hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/.tmp/294a7a31f04949b8bf07682a43157b35
> to
> hdfs://ip-10-140-14-134.ec2.internal:8020/apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/f1/294a7a31f04949b8bf07682a43157b35
> 2012-05-31 18:34:42,541 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> ip-10-68-7-146.ec2.internal,60020,1338343120038: Replay of HLog required.
> Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> TestLoadAndVerify_1338488017181,\x15\xD9\x01\x00\x00\x00\x00\x00/000087_0,1338491364569.8974506aa04c5a04e5cc23c11de0039d.
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1288)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1172)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1114)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:400)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:374)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:243)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist:
> /apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/f1/294a7a31f04949b8bf07682a43157b35
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1901)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1892)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:636)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
> at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:387)
> at
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
> at
> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:470)
> at
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
> at
> org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:595)
> On the NameNode logs:
> 2012-05-31 18:34:42,588 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> FSDirectory.unprotectedRenameTo: failed to rename
> /apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/.tmp/294a7a31f04949b8bf07682a43157b35
> to
> /apps/hbase/data/TestLoadAndVerify_1338488017181/8974506aa04c5a04e5cc23c11de0039d/f1/294a7a31f04949b8bf07682a43157b35
> because destination's parent does not exist
> I haven't looked deeply yet but I guess it is a race of some sort.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira