One clarification on HDFS-385: the last post on that JIRA only means the submission of patch for branch-1. I don't see integration yet.
On Fri, Jul 13, 2012 at 6:27 AM, N Keywal <[email protected]> wrote: > I looked at this part of hdfs code, and > - it's not simple to add it in a clean way, even if doing it is possible. > - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes > actually. So changing this would not be without a lot of side effects. > - as a side note HADOOP-8144 is interesting... > > So not writing the WAL on the local machine could be a good medium > term option, that could likely be implemented with HDFS-385 (made > available recently in "branch-1". I don't know what it stands for). > > On Fri, Jul 13, 2012 at 9:53 AM, N Keywal <[email protected]> wrote: > > Another option would be to never write the wal locally: in nearly all > > cases it won't be used as it's on the dead box. And then the recovery > > would be directed by the NN to a dead DN in a single box failure. And > > we would have 3 copies instead of 2, increasing global reliability... > > > > On Fri, Jul 13, 2012 at 12:16 AM, N Keywal <[email protected]> wrote: > >> Hi Todd, > >> > >> Do you think the change would be too intrusive for hdfs? I aggree, > >> there are many less critical components in hadoop :-). I was hoping > >> that this state could be internal to the NN and could remain localized > >> without any interface change... > >> > >> Your proposal would help for sure. I see 3 points if we try to do it > >> for specific functions like recovery. > >> - we would then need to manage the case when all 3 nodes timeouts > >> after 1s, hoping that two of them are wrong positive... > >> - the writes between DN would still be with the old timeout. I didn't > >> look in details at the impact. It won't be an issue for single box > >> crash, but for large failure it could. > >> - we would want to change it to for the ipc.Client as well. Note sure > >> if the change would not be visible to all functions. > >> > >> What worries me about setting very low timeouts is that it's difficult > >> to validate, it tends to work until it goes to production... > >> > >> I was also thinking of making the deadNodes list public in the client, > >> so hbase could tell to the DFSClient: 'this node is dead, I know it > >> because I'm recovering the RS', but it would have some false positive > >> (software region server crash), and seems a little like a > >> workaround... > >> > >> In the middle (thinking again about your proposal), we could add a > >> function in hbase that would first check the DNs owning the WAL, > >> trying to connect with a 1s timeout, to be able to tell the DFSClient > >> who's dead. > >> Or we could put this function in DFSClient, a kind of boolean to say > >> fail fast on dn errors for this read... > >> > >> > >> > >> On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon <[email protected]> > wrote: > >>> Hey Nicolas, > >>> > >>> Another idea that might be able to help this without adding an entire > >>> new state to the protocol would be to just improve the HDFS client > >>> side in a few ways: > >>> > >>> 1) change the "deadnodes" cache to be a per-DFSClient structure > >>> instead of per-stream. So, after reading one block, we'd note that the > >>> DN was dead, and de-prioritize it on future reads. Of course we'd need > >>> to be able to re-try eventually since dead nodes do eventually > >>> restart. > >>> 2) when connecting to a DN, if the connection hasn't succeeded within > >>> 1-2 seconds, start making a connection to another replica. If the > >>> other replica succeeds first, then drop the connection to the first > >>> (slow) node. > >>> > >>> Wouldn't this solve the problem less invasively? > >>> > >>> -Todd > >>> > >>> On Thu, Jul 12, 2012 at 2:20 PM, N Keywal <[email protected]> wrote: > >>>> Hi, > >>>> > >>>> I have looked at the HBase MTTR scenario when we lose a full box with > >>>> its datanode and its hbase region server altogether: It means a RS > >>>> recovery, hence reading the logs files and writing new ones (splitting > >>>> logs). > >>>> > >>>> By default, HDFS considers a DN as dead when there is no heartbeat for > >>>> 10:30 minutes. Until this point, the NaneNode will consider it as > >>>> perfectly valid and it will get involved in all read & write > >>>> operations. > >>>> > >>>> And, as we lost a RegionServer, the recovery process will take place, > >>>> so we will read the WAL & write new log files. And with the RS, we > >>>> lost the replica of the WAL that was with the DN of the dead box. In > >>>> other words, 33% of the DN we need are dead. So, to read the WAL, per > >>>> block to read and per reader, we've got one chance out of 3 to go to > >>>> the dead DN, and to get a connect or read timeout issue. With a > >>>> reasonnable cluster and a distributed log split, we will have a sure > >>>> winner. > >>>> > >>>> > >>>> I looked in details at the hdfs configuration parameters and their > >>>> impacts. We have the calculated values: > >>>> heartbeat.interval = 3s ("dfs.heartbeat.interval"). > >>>> heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval") > >>>> heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes > >>>> > >>>> At least on 1.0.3, there is no shutdown hook to tell the NN to > >>>> consider this DN as dead, for example on a software crash. > >>>> > >>>> So before the 10:30 minutes, the DN is considered as fully available > >>>> by the NN. After this delay, HDFS is likely to start replicating the > >>>> blocks contained in the dead node to get back to the right number of > >>>> replica. As a consequence, if we're too aggressive we will have a side > >>>> effect here, adding workload to an already damaged cluster. According > >>>> to Stack: "even with this 10 minutes wait, the issue was met in real > >>>> production case in the past, and the latency increased badly". May be > >>>> there is some tuning to do here, but going under these 10 minutes does > >>>> not seem to be an easy path. > >>>> > >>>> For the clients, they don't fully rely on the NN feedback, and they > >>>> keep, per stream, a dead node list. So for a single file, a given > >>>> client will do the error once, but if there are multiple files it will > >>>> go back to the wrong DN. The settings are: > >>>> > >>>> connect/read: (3s (hardcoded) * NumberOfReplica) + 60s > ("dfs.socket.timeout") > >>>> write: (5s (hardcoded) * NumberOfReplica) + 480s > >>>> ("dfs.datanode.socket.write.timeout") > >>>> > >>>> That will set a 69s timeout to get a "connect" error with the default > config. > >>>> > >>>> I also had a look at larger failure scenarios, when we're loosing a > >>>> 20% of a cluster. The smaller the cluster is the easier it is to get > >>>> there. With the distributed log split, we're actually on a better > >>>> shape from an hdfs point of view: the master could have error writing > >>>> the files, because it could bet a dead DN 3 times in a row. If the > >>>> split is done by the RS, this issue disappears. We will however get a > >>>> lot of errors between the nodes. > >>>> > >>>> Finally, I had a look at the lease stuff Lease: write access lock to a > >>>> file, no other client can write to the file. But another client can > >>>> read it. Soft lease limit: another client can preempt the lease. > >>>> Configurable. > >>>> Default: 1 minute. > >>>> Hard lease limit: hdfs closes the file and free the resources on > >>>> behalf of the initial writer. Default: 60 minutes. > >>>> > >>>> => This should not impact HBase, as it does not prevent the recovery > >>>> process to read the WAL or to write new files. We just need writes to > >>>> be immediately available to readers, and it's possible thanks to > >>>> HDFS-200. So if a RS dies we should have no waits even if the lease > >>>> was not freed. This seems to be confirmed by tests. > >>>> => It's interesting to note that this setting is much more aggressive > >>>> than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in > >>>> HBase, than the default ZK timeout (3 minutes). > >>>> => This said, HDFS states this: "When reading a file open for writing, > >>>> the length of the last block still being written is unknown > >>>> to the NameNode. In this case, the client asks one of the replicas for > >>>> the latest length before starting to read its content.". This leads to > >>>> an extra call to get the file length on the recovery (likely with the > >>>> ipc.Client), and we may once again go to the wrong dead DN. In this > >>>> case we have an extra socket timeout to consider. > >>>> > >>>> On paper, it would be great to set "dfs.socket.timeout" to a minimal > >>>> value during a log split, as we know we will get a dead DN 33% of the > >>>> time. It may be more complicated in real life as the connections are > >>>> shared per process. And we could still have the issue with the > >>>> ipc.Client. > >>>> > >>>> > >>>> As a conclusion, I think it could be interesting to have a third > >>>> status for DN in HDFS: between live and dead as today, we could have > >>>> "sick". We would have: > >>>> 1) Dead, known as such => As today: Start to replicate the blocks to > >>>> other nodes. You enter this state after 10 minutes. We could even wait > >>>> more. > >>>> 2) Likely to be dead: don't propose it for write blocks, put it with a > >>>> lower priority for read blocks. We would enter this state in two > >>>> conditions: > >>>> 2.1) No heartbeat for 30 seconds (configurable of course). As there > >>>> is an existing heartbeat of 3 seconds, we could even be more > >>>> aggressive here. > >>>> 2.2) We could have a shutdown hook in hdfs such as when a DN dies > >>>> 'properly' it says to the NN, and the NN can put it in this 'half dead > >>>> state'. > >>>> => In all cases, the node stays in the second state until the 10.30 > >>>> timeout is reached or until a heartbeat is received. > >>>> 3) Live. > >>>> > >>>> For HBase it would make life much simpler I think: > >>>> - no 69s timeout on mttr path > >>>> - less connection to dead nodes leading to ressources held all other > >>>> the place finishing by a timeout... > >>>> - and there is already a very aggressive 3s heartbeat, so we would > >>>> not add any workload. > >>>> > >>>> Thougths? > >>>> > >>>> Nicolas > >>> > >>> > >>> > >>> -- > >>> Todd Lipcon > >>> Software Engineer, Cloudera >
