Re: State of HA

Jean-Daniel Cryans Tue, 02 Jun 2009 07:58:15 -0700

Andrew,

I think you are confusing some components of the whole stack here. The
Namenode is the master for HDFS just like the HMaster is the master
for HBase. Hadoop is 2 things : HDFS and an implementation of
MapReduce which also has a master, the JobTracker. HBase sits on all
that.


So with regards with what's fixed, the HMaster SPOF is fixed for 0.20.
The Namenode in 0.20 is still a SPOF. That means, if you want HA, you
should get a really reliable machine for the Namenode but you can put
the HMaster on any nodes you want.
AFAIK, there is a BackupNamenode in Hadoop 0.21 that serves as a
Namenode failover.

J-D

On Tue, Jun 2, 2009 at 10:49 AM,  <[email protected]> wrote:
> Occasionally, I think that I am getting all of this, but then a statement 
> like this appears:
>
> "To end on a sour note, HDFS Namenode is still a SPOF.  When we're done with 
> HBase 0.20 it should be the only SPOF."
>
> So now I am confused all over again. I thought that any namenode SPOF that 
> was fixed in Hadoop would also imply that it was fixed in HDFS. Doesn't HDFS 
> use Hadoop in some form to M/R the reads/writes? If that is not the case and 
> HDFS is going to suffer from a namenode SPOF in the near-term, are there 
> plans in the works to remedy that too?
>
> -----Original Message-----
> From: ext Ryan Rawson [mailto:[email protected]]
> Sent: 01 June, 2009 16:57
> To: [email protected]
> Subject: Re: State of HA
>
> Hey,
>
> Stack is saying that for HADOOP-4379, it fails 1/5th of the time - recovery
> takes more than 15 minutes, aka potentially unlimited amount of time.  That
> patch relies on lease recovery it seems, so it may not be the final answer
> for us.
>
> Now, on the subject of the rest of things, under Zookeeper we are doing a
> much better job at HA.  Regionserver crashes are detect significantly faster
> than the 2 minute lease timeout, with my fixes you can take down any
> regionserver without getting 'stuck' with an unassigned ROOT/META
> (previously a problem).
>
> I have noticed on trunk I can kill and restart the master w/o taking down
> the cluster.  During master start-up it does a fairly good job at detecting
> node status and otherwise recovering.  I can't say about master elections
> exactly yet.
>
> The HA story is shaping up nicely.
>
> To end on a sour note, HDFS Namenode is still a SPOF.  When we're done with
> HBase 0.20 it should be the only SPOF.
>
> -ryan
>
> On Mon, Jun 1, 2009 at 1:50 PM, <[email protected]> wrote:
>
>> I am trying to parse this: are you implying that I can expect a 20% ("1 out
>> of 5 or so") success getting HA to work with this code?
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of ext
>> stack
>> Sent: 01 June, 2009 13:27
>> To: [email protected]
>> Subject: Re: State of HA
>>
>> You can pull TRUNK and try it with HADOOP-4379.
>>
>> The master failover works as J-D suggests.  It needs some polish but thats
>> on its way.  The HADOOP-4379 will get you a sync that works most of the
>> time
>> (1 out of 5 or so in my testing) but hopefully that'll be addressed soon
>> too.  You'll also need HBASE-1470.   Its the bit of code that exploits
>> HADOOP-4379 when configuration is set right).
>>
>> If you need help setting up stuff, you know where to find us.  Issues we
>> want to hear about because we're hoping to tell the above as part of our
>> 0.20.0 release story.
>>
>> Yours,
>> St.Ack
>>
>> On Mon, Jun 1, 2009 at 7:59 AM, <[email protected]> wrote:
>>
>> > Hello,
>> >
>> > I have been looking at Jira and trying to get a current snapshot of the
>> > state of HA for HBase/Hadoop? I know that the zookeeper integration is
>> the
>> > core of the HA story, but when is that slated for a "stable" debut? Is
>> there
>> > anything that is currently in svn that we can pull and test?
>> >
>> > TIA,
>> >
>> > Andrew
>> >
>> >
>>
>

Re: State of HA

Reply via email to