[
https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932814#action_12932814
]
Todd Lipcon commented on HBASE-3243:
------------------------------------
Here are some relevant logs... the region in question is
user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1
Here's the birth of the region on haus01:
2010-11-16 02:28:10,471 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Instantiated
usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.
2010-11-16 02:28:11,432 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Region split, META
updated, and report to master.
Parent=usertable,user1057689679,1289901563182.24d8aec47045a86b97b3508ed16ced29.,
new regions:
usertable,user1057689679,1289903283192.b313a3864e1eb1aaf2a72dcf2c2814b3.,
usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.. Split
took 8sec
Master gets split:
2010-11-16 02:28:11,450 INFO org.apache.hadoop.hbase.master.ServerManager:
Received REGION_SPLIT:
usertable,user1057689679,1289901563182.24d8aec47045a86b97b3508ed16ced29.:
Daughters;
usertable,user1057689679,1289903283192.b313a3864e1eb1aaf2a72dcf2c2814b3.,
usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1. from
haus01.sf.cloudera.com,60020,1289890926766
.. then plenty of successful flushes/compactions on haus01, no mention on any
other region server. The benchmark finished around 3:20am, and no mention of
this region in any logs from that point until I disabled the table this evening:
Master says:
2010-11-16 18:16:17,223 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Starting unassignment of region
usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.
(offlining)
Region server *haus02* (NOTE: this is the wrong regionserver!!) says:
2010-11-16 18:16:17,238 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Received close region:
usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.
2010-11-16 18:16:17,239 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for region
we are not serving; 8a2c525797cad9fd9882cc6304362bd1
Master says:
2010-11-16 18:16:17,240 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Attempted to send CLOSE to
serverName=haus02.sf.cloudera.com,60020,1289890927217, load=(requests=1029,
regions=28, usedHeap=4772, maxHeap=8185) for region
usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1. but
failed, setting region as OFFLINE and reassigning
...
2010-11-16 18:16:17,248 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Forcing OFFLINE;
was=usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.
state=PENDING_CLOSE, ts=1289960177227
2010-11-16 18:16:17,248 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Table usertable disabling; skipping assign of
usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.
2010-11-16 18:16:17,248 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x12c537d84e10000 Deleting existing unassigned node for
8a2c525797cad9fd9882cc6304362bd1 that is in expected state RS_ZK_REGION_CLOSED
2010-11-16 18:16:17,248 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
master:60000-0x12c537d84e10000 Unable to get data of znode
/hbase/unassigned/8a2c525797cad9fd9882cc6304362bd1 because node does not exist
(not necessarily an error)
So basically, the master randomly got the wrong regionserver for this region,
and there's little indication why.
> Disable Table closed region on wrong host
> -----------------------------------------
>
> Key: HBASE-3243
> URL: https://issues.apache.org/jira/browse/HBASE-3243
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Priority: Blocker
> Fix For: 0.90.0
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data
> overnight. Then I disabled the table, and the master for some reason closed
> one region on the wrong server. The server ignored this, but the region
> remained open on a different server, which later flipped out when it tried to
> flush due to hlog accumulation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.