[
https://issues.apache.org/jira/browse/HBASE-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrei Dragomir updated HBASE-2176:
-----------------------------------
Attachment: 799255.txt
Log forensics on our cluster.
> HRegionInfo reported empty on regions in meta, leading to them being deleted,
> although the regions contain data and exist
> -------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-2176
> URL: https://issues.apache.org/jira/browse/HBASE-2176
> Project: Hadoop HBase
> Issue Type: Bug
> Affects Versions: 0.21.0
> Reporter: Andrei Dragomir
> Priority: Critical
> Attachments: 799255.txt
>
>
> We ran some tests on our cluster, and getting back reports about
> WrongRegionException, on some rows. After looking at the data, we see that we
> have "gaps" between regions, like this:
> {noformat}
> demo__users,user_8949795897,1264089193398 l2:60030 736660864
> user_8949795897 user_8950697145 <- end key
> demo__users,user_8953502603,1263992844343 l5:60030 593335873
> user_8953502603 <- should be star key here user_8956071605
> {noformat}
> Fact: we had 28 regions that were reported with empty HRegionInfo, and
> deleted from .META..
> Fact: we recovered our data entirely, without any issues, by running the
> .META. restore script from table contents (bin/add_table.rb)
> Fact: on our regionservers, we have three days with no logs. To the best of
> our knowledge, the machines were not rebooted, the processes were running.
> During these three days, on the master, the only entry in the logs
> (repeated), every second, is a .META. scan:
> {noformat}
> 2010-01-23 00:01:27,816 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 10.72.135.7:60020, regionname: -ROOT-,,0, startKey: <>} complete
> 2010-01-23 00:01:34,413 INFO org.apache.hadoop.hbase.master.ServerManager: 6
> region servers, 0 dead, average load 1113.6666666666667
> 2010-01-23 00:02:23,645 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scanning meta region {server: 10.72.135.10:60020,
> regionname: .META.,,1, startKey: <>}
> 2010-01-23 00:02:26,002 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scan of 6679 row(s) of meta region {server:
> 10.72.135.10:60020, regionname: .META.,,1, startKey: <>} complete
> 2010-01-23 00:02:26,002 INFO org.apache.hadoop.hbase.master.BaseScanner: All
> 1 .META. region(s) scanned
> 2010-01-23 00:02:27,821 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 10.72.135.7:60020,
> regionname: -ROOT-,,0, startKey: <>}
> .......................................................
> {noformat}
> In the master logs, we see a pretty normal evolution: region r0 is split into
> r1 and r2. Now, r1 exists and is good, r2 does not exist in .META. anymore,
> because it was reported as having empty HRegionInfo. The only thing in the
> master logs that is weird is that the message about updating the region in
> meta comes up twice:
> {noformat}
> 2010-01-27 22:46:45,007 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> demo__users,user_8950697145,1264089193398 open on 10.72.135.7:60020
> 2010-01-27 22:46:45,010 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
> demo__users,user_8950697145,1264089193398 in region .META.,,1 with
> startcode=1264661019484, server=10.72.135.7:60020
> 2010-01-27 22:46:45,010 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> demo__users,user_8950697145,1264089193398 open on 10.72.135.7:60020
> 2010-01-27 22:46:45,012 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
> demo__users,user_8950697145,1264089193398 in region .META.,,1 with
> startcode=1264661019484, server=10.72.135.7:60020
> {noformat}
> Attached you will find the entire forensics work, with explanations, in a
> text file.
> Suppositions:
> Our entire cluster was in a really weird state. All the regionservers are
> missing logs for three days, and to the best of our knowledge they were
> running, and in this time the master has ONLY .META. scan messages, every
> second, reporting 6 regionservers live, out of 7 total.
> Also, during this time, we get filesystem closed messages on a regionservers
> with one of the missing regions. This is after the gap in the logs.
> How we suppose the data in .META. was lost
> 1. Race conditions in ServerManager / RegionManager. In our logs, we have
> about 3 or 4 CME, in these classes (see the attached file)
> 2. Data loss in HDFS. On a regionserver, we get filesystem closed messages
> 3. Data could not be read fro HDFS ( highly unlikely, there are no weird data
> read messages)
> 4. Race condition leading to loss of the HRegionInfo from memory, and then
> persisted as empty.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.