I suspect problems in region splitting is the root of META holes especially for onesy-twosy hbck region consistency problems.
Jon. On Thu, Jan 5, 2012 at 8:40 PM, Vladimir Rodionov <[email protected]>wrote: > Jon, > > My question was about "orphaned" data in hdfs in a first place. It looks > like > either region splits or table deletes (or both) are not executed correctly > (with old data not being removed > completely). > > Our original issue was related to .META. inconsistency (region holes) for > one of our internal system table. > How it occurred is beyond my comprehension, therefore I can't say for sure > what was the reason. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [email protected] > > ________________________________________ > From: Jonathan Hsieh [[email protected]] > Sent: Thursday, January 05, 2012 5:58 PM > To: [email protected] > Subject: Re: OfflineMetaRepair? > > Vlad, > > If it is a deleted table, you can just delete those dirs out of hdfs > directly. > > The work flow for this first cut of the tool is cautious and requires the > user to make decisions on what do with orphaned data and manually handle > them. Basically, at the time, I had only encountered this kind of problem > a few times, I didn't want the tool to delete any data, and I wanted push > that decision to the user. > > The problem that triggered me to write this tool was a situation where 2300 > meta rows were bad and 3 hdfs regiondirs were missing .regioninfo files. > Manually repairing meta was out of the question. The likely cause in that > situation was that the hdfs nn died under hbase and hbase likely got > getting confused during recovery. > > Other cases where I've encountered similar problems generally have to do > with regionsplits that failed to complete successfully and failed to > rollback properly. > > Did you encounter any these kinds events that could have triggered your > problems? > > FWIW, I'm in the process of debugging a new version (HBASE-5128) of the > tool that is tries to automatically restore data while online. Hopefully > this can repair bad region splits in a relatively painless manner. > Currently the tests cases are good now and I'm testing against a real > cluster that I'm intentionally corrupting. Hopefully should have a patch > for 0.90.5 ready in a few days (but there may be limitations). > > Jon. > > On Thu, Jan 5, 2012 at 5:37 PM, Vladimir Rodionov > <[email protected]>wrote: > > > I cp'ed hdfs-site.xml into HBASE_CONF_DIR and was able tun the tool. > > > > The tool found a lot of abandoned regions: > > > > like this one: > > > > 12/01/06 01:18:15 ERROR util.HBaseFsck: Bailed out due to: > > org.apache.hadoop.hbase.util.HBaseFsck$RegionInfoLoadException: Unable to > > load region info for table TRIAL-DIMENSIONS-1324576713641! It may be an > > invalid format or version file. You may want to remove hdfs:// > > > us01-ciqps1-name01.carrieriq.com:9000/hbase/TRIAL-DIMENSIONS-1324576713641/ff6031e6472d10bac8517314179acb33regionfrom > hdfs and retry. > > at > > org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:292) > > at > > org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402) > > at > > > org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRepair.java:90) > > > > There are hundreds of such regions, literally. > > > > Region directories contain only .tmp subdir, like this one: > > > > > > > /hbase/M2M-INTEGRATION-MM_ERRORS-1324575562966/fd480b2c39f7d3333308bf1d9a304510/.tmp > > > > No .regioninfo > > > > These dirs are left-overs of a tables which have been deleted already and > > they confuse this tool. If we delete table we should wipe out the whole > > directory, is not it? > > Is there any scenario which can explain this? > > > > Best regards, > > Vladimir Rodionov > > Principal Platform Engineer > > Carrier IQ, www.carrieriq.com > > e-mail: [email protected] > > > > ________________________________________ > > From: Todd Lipcon [[email protected]] > > Sent: Thursday, January 05, 2012 4:50 PM > > To: [email protected] > > Subject: Re: OfflineMetaRepair? > > > > Are you sure you have fs.default.name set properly to hdfs://yournn/ > > in your hbase-site.xml? > > > > You shouldn't *have* to do this, but I bet it will fix the issue. > > > > -Todd > > > > On Thu, Jan 5, 2012 at 4:26 PM, Jonathan Hsieh <[email protected]> wrote: > > > Hey Vlad, > > > > > > I wrote the tool -- and I've used it to repair a fairly messed up META > > > table. I must of used on a local filesystem copy of META (just got all > > the > > > .regioninfo files in their directory paths), and then shipped the > > repaired > > > version of the .META. dir to the customer. > > > > > > This is definitely a bug. FIle the jira and I'll try to fix in the > next > > > few days. > > > > > > Jon. > > > > > > On Thu, Jan 5, 2012 at 4:16 PM, Vladimir Rodionov > > > <[email protected]>wrote: > > > > > >> Ted, > > >> > > >> "fs.default.name" is a standard config property name which is > described > > >> here: > > >> http://hadoop.apache.org/common/docs/current/core-default.html > > >> > > >> It is not CDH -specific. If you are right than this tool has never > been > > >> tested. > > >> > > >> Best regards, > > >> Vladimir Rodionov > > >> Principal Platform Engineer > > >> Carrier IQ, www.carrieriq.com > > >> e-mail: [email protected] > > >> > > >> ________________________________________ > > >> From: Ted Yu [[email protected]] > > >> Sent: Thursday, January 05, 2012 4:06 PM > > >> To: [email protected] > > >> Subject: Re: OfflineMetaRepair? > > >> > > >> Vlad: > > >> In the future, please drop unrelated discussion from bottom of your > > email. > > >> > > >> I think what you saw was caused by FS default name not being set > > correctly. > > >> In hbck: > > >> conf.set("fs.defaultFS", conf.get(HConstants.HBASE_DIR)); > > >> But cdh3 uses: > > >> conf.set("fs.default.name", "hdfs://localhost:0"); > > >> ./src/test/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java > > >> > > >> You can try adding the following line after line 77 of > > >> OfflineMetaRepair.java: > > >> conf.set("fs.default.name", path); > > >> and rebuilding hbase 0.90.6 (tip of 0.92 branch) > > >> > > >> If the above works, please file a JIRA. > > >> > > >> Thanks > > >> > > >> On Thu, Jan 5, 2012 at 3:30 PM, Vladimir Rodionov > > >> <[email protected]>wrote: > > >> > > >> > 0.90.5 > > >> > > > >> > I am trying to repair .META. table using this tool > > >> > > > >> > 1. HBase cluster was shutdown > > >> > > > >> > Then I ran: > > >> > > > >> > 2. [name01 bin]$ hbase > > >> org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair > > >> > -base hdfs://us01-ciqps1-name01.carrieriq.com:9000/hbase -details > > >> > > > >> > > > >> > This is waht I got: > > >> > > > >> > 12/01/05 23:23:15 INFO util.HBaseFsck: Loading HBase regioninfo from > > >> > HDFS... > > >> > 12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to: > > >> > java.lang.IllegalArgumentException: Wrong FS: hdfs:// > > >> > > > >> > > > us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-1325190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo > > >> , > > >> > expected: file:/// > > >> > at > > org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352) > > >> > at > > >> > > > >> > > > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) > > >> > at > > >> > > > >> > > > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368) > > >> > at > > >> > > > >> > > > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > > >> > at > > >> > > > >> > > > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126) > > >> > at > > >> > > > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284) > > >> > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398) > > >> > at > > >> > > > org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256) > > >> > at > > >> > > > org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284) > > >> > at > > >> > > org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402) > > >> > at > > >> > > > >> > > > org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRepair.java:90) > > >> > > > >> > > > >> > Q: What am I doing wrong? > > >> > > > >> > Best regards, > > >> > Vladimir Rodionov > > >> > Principal Platform Engineer > > >> > Carrier IQ, www.carrieriq.com > > >> > e-mail: [email protected] > > >> > > > >> > > > >> > > >> Confidentiality Notice: The information contained in this message, > > >> including any attachments hereto, may be confidential and is intended > > to be > > >> read only by the individual or entity to whom this message is > > addressed. If > > >> the reader of this message is not the intended recipient or an agent > or > > >> designee of the intended recipient, please note that any review, use, > > >> disclosure or distribution of this message or its attachments, in any > > form, > > >> is strictly prohibited. If you have received this message in error, > > please > > >> immediately notify the sender and/or [email protected] and > > >> delete or destroy any copy of this message and its attachments. > > >> > > > > > > > > > > > > -- > > > // Jonathan Hsieh (shay) > > > // Software Engineer, Cloudera > > > // [email protected] > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > > Confidentiality Notice: The information contained in this message, > > including any attachments hereto, may be confidential and is intended to > be > > read only by the individual or entity to whom this message is addressed. > If > > the reader of this message is not the intended recipient or an agent or > > designee of the intended recipient, please note that any review, use, > > disclosure or distribution of this message or its attachments, in any > form, > > is strictly prohibited. If you have received this message in error, > please > > immediately notify the sender and/or [email protected] and > > delete or destroy any copy of this message and its attachments. > > > > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [email protected] > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or [email protected] and > delete or destroy any copy of this message and its attachments. > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [email protected]
