Re: hbase mttr vs. hdfs

lars hofhansl Fri, 13 Jul 2012 10:12:23 -0700

In that case, though, we'd slow down normal operation.
Maybe that can be alleviated with HDFS-1783/HBASE-6116, although as mentioned 
in HBASE-6116, I have not been able to measure any performance improvement from 
this so far.


-- Lars



________________________________
 From: N Keywal <[email protected]>
To: [email protected] 
Sent: Friday, July 13, 2012 6:27 AM
Subject: Re: hbase mttr vs. hdfs
 
I looked at this part of hdfs code, and
- it's not simple to add it in a clean way, even if doing it is possible.
- i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
actually. So changing this would not be without a lot of side effects.
- as a side note HADOOP-8144 is interesting...

So not writing the WAL on the local machine could be a good medium
term option, that could likely be implemented with HDFS-385 (made
available recently in "branch-1". I don't know what it stands for).

On Fri, Jul 13, 2012 at 9:53 AM, N Keywal <[email protected]> wrote:
> Another option would be to never write the wal locally: in nearly all
> cases it won't be used as it's on the dead box. And then the recovery
> would be directed by the NN to a dead DN in a single box failure. And
> we would have 3 copies instead of 2, increasing global reliability...
>
> On Fri, Jul 13, 2012 at 12:16 AM, N Keywal <[email protected]> wrote:
>> Hi Todd,
>>
>> Do you think the change would be too intrusive for hdfs? I aggree,
>> there are many less critical components in hadoop :-). I was hoping
>> that this state could be internal to the NN and could remain localized
>> without any interface change...
>>
>> Your proposal would help for sure. I see 3 points if we try to do it
>> for specific functions like recovery.
>>  - we would then need to manage the case when all 3 nodes timeouts
>> after 1s, hoping that two of them are wrong positive...
>>  - the writes between DN would still be with the old timeout. I didn't
>> look in details at the impact. It won't be an issue for single box
>> crash, but for large failure it could.
>>  - we would want to change it to for the ipc.Client as well. Note sure
>> if the change would not be visible to all functions.
>>
>> What worries me about setting very low timeouts is that it's difficult
>> to validate, it tends to work until it goes to production...
>>
>> I was also thinking of making the deadNodes list public in the client,
>> so hbase could tell to the DFSClient: 'this node is dead, I know it
>> because I'm recovering the RS', but it would have some false positive
>> (software region server crash), and seems a little like a
>> workaround...
>>
>> In the middle (thinking again about your proposal), we could add a
>> function in hbase that would first check the DNs owning the WAL,
>> trying to connect with a 1s timeout, to be able to tell the DFSClient
>> who's dead.
>> Or we could put this function in DFSClient, a kind of boolean to say
>> fail fast on dn errors for this read...
>>
>>
>>
>> On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon <[email protected]> wrote:
>>> Hey Nicolas,
>>>
>>> Another idea that might be able to help this without adding an entire
>>> new state to the protocol would be to just improve the HDFS client
>>> side in a few ways:
>>>
>>> 1) change the "deadnodes" cache to be a per-DFSClient structure
>>> instead of per-stream. So, after reading one block, we'd note that the
>>> DN was dead, and de-prioritize it on future reads. Of course we'd need
>>> to be able to re-try eventually since dead nodes do eventually
>>> restart.
>>> 2) when connecting to a DN, if the connection hasn't succeeded within
>>> 1-2 seconds, start making a connection to another replica. If the
>>> other replica succeeds first, then drop the connection to the first
>>> (slow) node.
>>>
>>> Wouldn't this solve the problem less invasively?
>>>
>>> -Todd
>>>
>>> On Thu, Jul 12, 2012 at 2:20 PM, N Keywal <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> I have looked at the HBase MTTR scenario when we lose a full box with
>>>> its datanode and its hbase region server altogether: It means a RS
>>>> recovery, hence reading the logs files and writing new ones (splitting
>>>> logs).
>>>>
>>>> By default, HDFS considers a DN as dead when there is no heartbeat for
>>>> 10:30 minutes. Until this point, the NaneNode will consider it as
>>>> perfectly valid and it will get involved in all read & write
>>>> operations.
>>>>
>>>> And, as we lost a RegionServer, the recovery process will take place,
>>>> so we will read the WAL & write new log files. And with the RS, we
>>>> lost the replica of the WAL that was with the DN of the dead box. In
>>>> other words, 33% of the DN we need are dead. So, to read the WAL, per
>>>> block to read and per reader, we've got one chance out of 3 to go to
>>>> the dead DN, and to get a connect or read timeout issue. With a
>>>> reasonnable cluster and a distributed log split, we will have a sure
>>>> winner.
>>>>
>>>>
>>>> I looked in details at the hdfs configuration parameters and their
>>>> impacts. We have the calculated values:
>>>> heartbeat.interval = 3s ("dfs.heartbeat.interval").
>>>> heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
>>>> heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes
>>>>
>>>> At least on 1.0.3, there is no shutdown hook to tell the NN to
>>>> consider this DN as dead, for example on a software crash.
>>>>
>>>> So before the 10:30 minutes, the DN is considered as fully available
>>>> by the NN.  After this delay, HDFS is likely to start replicating the
>>>> blocks contained in the dead node to get back to the right number of
>>>> replica. As a consequence, if we're too aggressive we will have a side
>>>> effect here, adding workload to an already damaged cluster. According
>>>> to Stack: "even with this 10 minutes wait, the issue was met in real
>>>> production case in the past, and the latency increased badly". May be
>>>> there is some tuning to do here, but going under these 10 minutes does
>>>> not seem to be an easy path.
>>>>
>>>> For the clients, they don't fully rely on the NN feedback, and they
>>>> keep, per stream, a dead node list. So for a single file, a given
>>>> client will do the error once, but if there are multiple files it will
>>>> go back to the wrong DN. The settings are:
>>>>
>>>> connect/read:  (3s (hardcoded) * NumberOfReplica) + 60s 
>>>> ("dfs.socket.timeout")
>>>> write: (5s (hardcoded) * NumberOfReplica) + 480s
>>>> ("dfs.datanode.socket.write.timeout")
>>>>
>>>> That will set a 69s timeout to get a "connect" error with the default 
>>>> config.
>>>>
>>>> I also had a look at larger failure scenarios, when we're loosing a
>>>> 20% of a cluster. The smaller the cluster is the easier it is to get
>>>> there. With the distributed log split, we're actually on a better
>>>> shape from an hdfs point of view: the master could have error writing
>>>> the files, because it could bet a dead DN 3 times in a row. If the
>>>> split is done by the RS, this issue disappears. We will however get a
>>>> lot of errors between the nodes.
>>>>
>>>> Finally, I had a look at the lease stuff Lease: write access lock to a
>>>> file, no other client can write to the file. But another client can
>>>> read it. Soft lease limit: another client can preempt the lease.
>>>> Configurable.
>>>> Default: 1 minute.
>>>> Hard lease limit: hdfs closes the file and free the resources on
>>>> behalf of the initial writer. Default: 60 minutes.
>>>>
>>>> => This should not impact HBase, as it does not prevent the recovery
>>>> process to read the WAL or to write new files. We just need writes to
>>>> be immediately available to readers, and it's possible thanks to
>>>> HDFS-200. So if a RS dies we should have no waits even if the lease
>>>> was not freed. This seems to be confirmed by tests.
>>>> => It's interesting to note that this setting is much more aggressive
>>>> than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
>>>> HBase, than the default ZK timeout (3 minutes).
>>>> => This said, HDFS states this: "When reading a file open for writing,
>>>> the length of the last block still being written is unknown
>>>> to the NameNode. In this case, the client asks one of the replicas for
>>>> the latest length before starting to read its content.". This leads to
>>>> an extra call to get the file length on the recovery (likely with the
>>>> ipc.Client), and we may once again go to the wrong dead DN. In this
>>>> case we have an extra socket timeout to consider.
>>>>
>>>> On paper, it would be great to set "dfs.socket.timeout" to a minimal
>>>> value during a log split, as we know we will get a dead DN 33% of the
>>>> time. It may be more complicated in real life as the connections are
>>>> shared per process. And we could still have the issue with the
>>>> ipc.Client.
>>>>
>>>>
>>>> As a conclusion, I think it could be interesting to have a third
>>>> status for DN in HDFS: between live and dead as today, we could have
>>>> "sick". We would have:
>>>> 1) Dead, known as such => As today: Start to replicate the blocks to
>>>> other nodes. You enter this state after 10 minutes. We could even wait
>>>> more.
>>>> 2) Likely to be dead: don't propose it for write blocks, put it with a
>>>> lower priority for read blocks. We would enter this state in two
>>>> conditions:
>>>>   2.1) No heartbeat for 30 seconds (configurable of course). As there
>>>> is an existing heartbeat of 3 seconds, we could even be more
>>>> aggressive here.
>>>>   2.2) We could have a shutdown hook in hdfs such as when a DN dies
>>>> 'properly' it says to the NN, and the NN can put it in this 'half dead
>>>> state'.
>>>>   => In all cases, the node stays in the second state until the 10.30
>>>> timeout is reached or until a heartbeat is received.
>>>>  3) Live.
>>>>
>>>>  For HBase it would make life much simpler I think:
>>>>  - no 69s timeout on mttr path
>>>>  - less connection to dead nodes leading to ressources held all other
>>>> the place finishing by a timeout...
>>>>  - and there is already a very aggressive 3s heartbeat, so we would
>>>> not add any workload.
>>>>
>>>>  Thougths?
>>>>
>>>>  Nicolas
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera

Re: hbase mttr vs. hdfs

Reply via email to