If you kill datanode and regionserver, you may run into the issue Cosmin just put up a patch for in hdfs-630. St.Ack P.S. Thats good news its working for you Clint.
On Wed, Sep 30, 2009 at 3:39 PM, Ryan Rawson <[email protected]> wrote: > I've been working on HDFS-200 for a while, and I see similar > experiences. The questions I have for HDFS-265 is... how performant > is it? How expensive are the syncs? And just how good is the recovery? > > Next time try kill -9ing the regionserver and the datanode on the same > server. > > -ryan > > On Wed, Sep 30, 2009 at 3:35 PM, Clint Morgan <[email protected]> > wrote: > > I got this working, did a hard kill of regionserver, and it worked. > > > > I used the hadoop/hdfs/branches/HDFS-265 branch and was banging my head > > trying to get it work. Saw that hlog was reflectively calling > > SequenceFile.Writer.syncFs(). This method did not exist (in > > hadoop/common/branches/branch-0.21), so I naively changed it to call > sync(). > > But this is a different kind of sync... > > > > To get it to work I added the Writer.syncFs() method which just calls > > out.sync(). > > > > On Sat, Aug 8, 2009 at 7:51 PM, Andrew Purtell <[email protected]> > wrote: > > > >> I realized too late I was not running Hadoop with DEBUG, only HBase. > >> > >> I'll try again next month, when it will not hurt to lose data. > >> > >> - Andy > >> > >> > >> > >> > >> ________________________________ > >> From: stack <[email protected]> > >> To: [email protected] > >> Sent: Saturday, August 8, 2009 6:34:07 PM > >> Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity > >> > >> Didn't mean to be so short. I'd suggest that it would be good putting > your > >> experience up in HDFS-200/HADOOP-4379. Lads there'd be interested in > what > >> you've found. > >> St.Ack > >> > >> On Sat, Aug 8, 2009 at 9:36 AM, Andrew Purtell <[email protected]> > >> wrote: > >> > >> > Cluster down hard after RS failure. Master stuck indefinitely > splitting > >> > logs. > >> > Endless instances of this message, once per second: > >> > > >> > org.apache.hadoop.hdfs.DFSClient: Could not complete file > >> > /hbase/content/1965559571/oldlogfile.lo retrying... > >> > > >> > Turning off "dfs.support.append". > >> > > >> > - Andy > >> > > >> > > >> > > >> > > >> > ________________________________ > >> > From: stack <[email protected]> > >> > To: [email protected] > >> > Sent: Friday, August 7, 2009 12:34:40 PM > >> > Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity > >> > > >> > You are a good man Andrew. > >> > St.Ack > >> > > >> > On Fri, Aug 7, 2009 at 10:27 AM, Andrew Purtell <[email protected]> > >> > wrote: > >> > > >> > > I'm going to join you in testing this stack, taking the below as > config > >> > > recipe. > >> > > > >> > > - Andy > >> > > > >> > > > >> > > > >> > > > >> > > ________________________________ > >> > > From: stack <[email protected]> > >> > > To: [email protected] > >> > > Sent: Friday, August 7, 2009 9:54:53 AM > >> > > Subject: append (hadoop-4379), was -> Re: roadmap: data integrity > >> > > > >> > > Here is a quick note on the current state of my testing of > HADOOP-4379 > >> > > (support for 'append' in hadoop 0.20.x). > >> > > > >> > > On my small test cluster, I am not able to break the latest patch > >> posted > >> > by > >> > > Dhruba under heavy-loadings. It seems to basically work. On > >> > regionserver > >> > > crash, the master runs log split and when it comes to the last in > the > >> set > >> > > of > >> > > regionserver logs for splitting, the one that is inevitably unclosed > >> > > because > >> > > the process crashed, we are able to recover most edits in this last > >> file > >> > > (in > >> > > my testing, it seemed to be all edits up to the last flush of the > >> > > regionserver process). > >> > > > >> > > The upshot is that tentatively, we may have a "working" append in > the > >> > 0.20 > >> > > timeframe (In 0.21, we should have > >> > > https://issues.apache.org/jira/browse/HDFS-265). I'll keep testing > >> but > >> > > I'd > >> > > suggest its time for others to try out. > >> > > > >> > > With HADOOP-4379, the process recovering non-closed log files -- the > >> > master > >> > > in our case -- must successfully open the file in append mode and > then > >> > > close > >> > > it. Once closed, new readers can purportedly see up to the last > flush. > >> > > The > >> > > open to append can take a little while before it will go through > >> > (Complaint > >> > > is that another process holds the files' lease). Meantime, the > opening > >> > for > >> > > append process must retry. In my experience its taking 2-10 > seconds. > >> > > > >> > > Support for appends is off by default in hadoop even after > HADOOP-4379 > >> > has > >> > > been applied. To enable, you need to set dfs.support.append. Set > it > >> > > everywhere -- all over hadoop and in hbase-site.xml so > hbase/DFSClient > >> > can > >> > > see the attribute. > >> > > > >> > > HBase TRUNK will recognize if the bundled hadoop supports append via > >> > > introspection (SequenceFile has a new syncFs method when HADOOP-4379 > >> has > >> > > been applied). If an append-supporting hadoop is present, and > >> > > dfs.support.append is set in hbase context, then hbase when its > running > >> > > HLog#splitLog will try to opening files to append. On regionserver > >> > crash, > >> > > you can see the master HLog#splitLog loop retrying the open for > append > >> > > until > >> > > it is successful (You'll see in the master log complaint that lease > on > >> > the > >> > > file is held by another process). We retry every second. > >> > > > >> > > Successful recovery of all edits is uncovering new, interesting > issues. > >> > In > >> > > my testing I was killing regionserver only but also killing > >> regionserver > >> > > and > >> > > datanode. In latter case, what I would see is that namenode would > >> > continue > >> > > to assign the dead namenode work at least until its lease expired. > >> Fair > >> > > enough says you, only the datanode lease is ten minutes by default. > I > >> > set > >> > > it down in my tests using heartbeat.recheck.interval (There is a > >> pregnant > >> > > comment in HADOOP-4379 w/ clientside code where Ruyue Ma says they > get > >> > > around this issue by having client pass the namenode the datanodes > it > >> > knows > >> > > dead when asking for an extra block). We might want to recommend > >> setting > >> > > it > >> > > down in general. > >> > > > >> > > Other issues are hbase bugs we see when edits all recovered. I've > been > >> > > filing issues on these over last few days. > >> > > > >> > > St.Ack > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > On Fri, Aug 7, 2009 at 9:03 AM, Andrew Purtell <[email protected] > > > >> > > wrote: > >> > > > >> > > > Good to see there's direct edit replication support; that can make > >> > > > things easier. > >> > > > > >> > > > I've seen people use DRDB or NFS to replicate edits currently. > >> > > > > >> > > > Namenode failover is a "solvable" issue with traditional HA: OS > level > >> > > > heartbeats, fencing, fail over -- e.g. HA infrastructure daemon > >> starts > >> > > > NN instance on node B if heartbeat from node A is lost and takes a > >> > > > power control operation on A to make sure it is dead. On both > nodes > >> the > >> > > > infastructure daemons trigger the OS watchdog if the NN process > dies. > >> > > > Combine this with automatic IP address reassignment. Then, page > the > >> > > > operators. Add another node C for additional redundancy, and make > >> sure > >> > > > all of the alternatives are on separate racks and power rails, and > >> make > >> > > > sure the L2 and L3 topology is also HA (e.g. bonded ethernet to > >> > > > redundant switches at L2, mesh routing at L3, etc.) If the cluster > is > >> > > > not super huge it can all be spanned at L2 over redundant > switches. > >> L3 > >> > > > redundancy is tricker. A typical configuration could have a lot of > >> OSPF > >> > > > stub networks -- depends how L2 is partitoned -- which can make > the > >> > > > routing table difficult for operators to sort out. > >> > > > > >> > > > I've seen this type of thing work for myself, ~15 seconds from > >> > > > (simulated) fault on NN node A to the new NN up and responding to > DN > >> > > > reconnections on node B, with 0.19. > >> > > > > >> > > > You can build in additional assurance of fast failover by building > >> > > > redundant processes to run concurrently with a few datanodes which > >> over > >> > > > and over ping the NN via the namenode protocol and trigger fencing > >> and > >> > > > failover if it stops responding. > >> > > > > >> > > > One wrinkle is the new namenode starts up in safe mode. As long as > >> > > > HBase can handle temporary periods where the cluster goes into > >> > > > safemode after NN fail over, it can ride it out. > >> > > > > >> > > > This is ugly, but this is I believe an accepted and valid systems > >> > > > engineering solution for the NN SPOF issue for the folks I > mentioned > >> > > > in my previous email, something they would be familiar with. Edit > >> > > > replication support in HDFS 0.21 makes it a little less work to > >> > > > achieve and maybe a little faster to execute, so that's an > >> > > > improvement. > >> > > > > >> > > > It may be overstating it a little bit to say that the NN SPOF is > not > >> a > >> > > > concern for HBase, but, in my opinion, we need to address WAL and > >> > > > (lack of FSCK) issues first before being concerned about it. HBase > >> can > >> > > > lose data all on its own. > >> > > > > >> > > > - Andy > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > ________________________________ > >> > > > From: Jean-Daniel Cryans <[email protected]> > >> > > > To: [email protected] > >> > > > Sent: Friday, August 7, 2009 3:25:19 AM > >> > > > Subject: Re: roadmap: data integrity > >> > > > > >> > > > https://issues.apache.org/jira/browse/HADOOP-4539 > >> > > > > >> > > > This issue was closed long ago. But, Steve Loughran just said on > tha > >> > > > hadoop mailing list that the new NN has to come up with the same > >> > > > IP/hostname as the failed one. > >> > > > > >> > > > J-D > >> > > > > >> > > > On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson<[email protected]> > >> wrote: > >> > > > > WAL is a major issue, but another one that is coming up fast is > the > >> > > > > SPOF that is the namenode. > >> > > > > > >> > > > > Right now, namenode aside, I can rolling restart my entire > cluster, > >> > > > > including rebooting the machines if I needed to. But not so with > >> the > >> > > > > namenode, because if it does AWOL, all sorts of bad can happen. > >> > > > > > >> > > > > I hope that HDFS 0.21 addresses both these issues. Can we get > >> > > > > positive confirmation that this is being worked on? > >> > > > > > >> > > > > -ryan > >> > > > > > >> > > > > On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell< > >> [email protected]> > >> > > > wrote: > >> > > > >> I updated the roadmap up on the wiki: > >> > > > >> > >> > > > >> > >> > > > >> * Data integrity > >> > > > >> * Insure that proper append() support in HDFS actually > closes > >> the > >> > > > >> WAL last block write hole > >> > > > >> * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for > >> 0.21 > >> > > > >> > >> > > > >> I have had several recent conversations on my travels with > people > >> in > >> > > > >> Fortune 100 companies (based on this list: > >> > > > >> http://www.wageproject.org/content/fortune/index.php). > >> > > > >> > >> > > > >> You and I know we can set up well engineered HBase 0.20 > clusters > >> > that > >> > > > >> will be operationally solid for a wide range of use cases, but > >> given > >> > > > >> those aforementioned discussions there are certain sectors > which > >> > would > >> > > > >> say HBASE-7 is #1 before HBase is "bank ready". Not until we > can > >> > say: > >> > > > >> > >> > > > >> - Yes, when the client sees data has been committed, it > actually > >> > has > >> > > > >> been written and replicated on spinning or solid state media in > >> all > >> > > > >> cases. > >> > > > >> > >> > > > >> - Yes, we go to great lengths to recover data if ${deity} > forbid > >> > you > >> > > > >> crush some underprovisioned cluster with load or some bizarre > bug > >> or > >> > > > >> system fault happens. > >> > > > >> > >> > > > >> HBASE-1295 is also required for business continuity reasons, > but > >> > this > >> > > > >> is already a priority item for some HBase committers. > >> > > > >> > >> > > > >> The question is I think does the above align with project > goals. > >> > > > >> Making HBase-FSCK a blocker will probably knock something > someone > >> > > > >> wants for the 0.21 timeframe off the list. > >> > > > >> > >> > > > >> - Andy > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > >> > > >> > > >> > > >> > > >> > >> > >> > >> > >> > > >
