HRegionInfo reported empty on regions in meta, leading to them being deleted,
although the regions contain data and exist
-------------------------------------------------------------------------------------------------------------------------
Key: HBASE-2176
URL: https://issues.apache.org/jira/browse/HBASE-2176
Project: Hadoop HBase
Issue Type: Bug
Affects Versions: 0.21.0
Reporter: Andrei Dragomir
Priority: Critical
We ran some tests on our cluster, and getting back reports about
WrongRegionException, on some rows. After looking at the data, we see that we
have "gaps" between regions, like this:
{noformat}
demo__users,user_8949795897,1264089193398 l2:60030 736660864 user_8949795897
user_8950697145 <- end key
demo__users,user_8953502603,1263992844343 l5:60030 593335873 user_8953502603
<- should be star key here user_8956071605
{noformat}
Fact: we had 28 regions that were reported with empty HRegionInfo, and deleted
from .META..
Fact: we recovered our data entirely, without any issues, by running the .META.
restore script from table contents (bin/add_table.rb)
Fact: on our regionservers, we have three days with no logs. To the best of our
knowledge, the machines were not rebooted, the processes were running. During
these three days, on the master, the only entry in the logs (repeated), every
second, is a .META. scan:
{noformat}
2010-01-23 00:01:27,816 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.72.135.7:60020, regionname: -ROOT-,,0, startKey: <>} complete
2010-01-23 00:01:34,413 INFO org.apache.hadoop.hbase.master.ServerManager: 6
region servers, 0 dead, average load 1113.6666666666667
2010-01-23 00:02:23,645 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.72.135.10:60020,
regionname: .META.,,1, startKey: <>}
2010-01-23 00:02:26,002 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 6679 row(s) of meta region {server:
10.72.135.10:60020, regionname: .META.,,1, startKey: <>} complete
2010-01-23 00:02:26,002 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1
.META. region(s) scanned
2010-01-23 00:02:27,821 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.72.135.7:60020,
regionname: -ROOT-,,0, startKey: <>}
.......................................................
{noformat}
In the master logs, we see a pretty normal evolution: region r0 is split into
r1 and r2. Now, r1 exists and is good, r2 does not exist in .META. anymore,
because it was reported as having empty HRegionInfo. The only thing in the
master logs that is weird is that the message about updating the region in meta
comes up twice:
{noformat}
2010-01-27 22:46:45,007 INFO
org.apache.hadoop.hbase.master.RegionServerOperation:
demo__users,user_8950697145,1264089193398 open on 10.72.135.7:60020
2010-01-27 22:46:45,010 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
demo__users,user_8950697145,1264089193398 in region .META.,,1 with
startcode=1264661019484, server=10.72.135.7:60020
2010-01-27 22:46:45,010 INFO
org.apache.hadoop.hbase.master.RegionServerOperation:
demo__users,user_8950697145,1264089193398 open on 10.72.135.7:60020
2010-01-27 22:46:45,012 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
demo__users,user_8950697145,1264089193398 in region .META.,,1 with
startcode=1264661019484, server=10.72.135.7:60020
{noformat}
Attached you will find the entire forensics work, with explanations, in a text
file.
Suppositions:
Our entire cluster was in a really weird state. All the regionservers are
missing logs for three days, and to the best of our knowledge they were
running, and in this time the master has ONLY .META. scan messages, every
second, reporting 6 regionservers live, out of 7 total.
Also, during this time, we get filesystem closed messages on a regionservers
with one of the missing regions. This is after the gap in the logs.
How we suppose the data in .META. was lost
1. Race conditions in ServerManager / RegionManager. In our logs, we have about
3 or 4 CME, in these classes (see the attached file)
2. Data loss in HDFS. On a regionserver, we get filesystem closed messages
3. Data could not be read fro HDFS ( highly unlikely, there are no weird data
read messages)
4. Race condition leading to loss of the HRegionInfo from memory, and then
persisted as empty.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.