Hey Nicolas, Another idea that might be able to help this without adding an entire new state to the protocol would be to just improve the HDFS client side in a few ways:
1) change the "deadnodes" cache to be a per-DFSClient structure instead of per-stream. So, after reading one block, we'd note that the DN was dead, and de-prioritize it on future reads. Of course we'd need to be able to re-try eventually since dead nodes do eventually restart. 2) when connecting to a DN, if the connection hasn't succeeded within 1-2 seconds, start making a connection to another replica. If the other replica succeeds first, then drop the connection to the first (slow) node. Wouldn't this solve the problem less invasively? -Todd On Thu, Jul 12, 2012 at 2:20 PM, N Keywal <[email protected]> wrote: > Hi, > > I have looked at the HBase MTTR scenario when we lose a full box with > its datanode and its hbase region server altogether: It means a RS > recovery, hence reading the logs files and writing new ones (splitting > logs). > > By default, HDFS considers a DN as dead when there is no heartbeat for > 10:30 minutes. Until this point, the NaneNode will consider it as > perfectly valid and it will get involved in all read & write > operations. > > And, as we lost a RegionServer, the recovery process will take place, > so we will read the WAL & write new log files. And with the RS, we > lost the replica of the WAL that was with the DN of the dead box. In > other words, 33% of the DN we need are dead. So, to read the WAL, per > block to read and per reader, we've got one chance out of 3 to go to > the dead DN, and to get a connect or read timeout issue. With a > reasonnable cluster and a distributed log split, we will have a sure > winner. > > > I looked in details at the hdfs configuration parameters and their > impacts. We have the calculated values: > heartbeat.interval = 3s ("dfs.heartbeat.interval"). > heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval") > heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes > > At least on 1.0.3, there is no shutdown hook to tell the NN to > consider this DN as dead, for example on a software crash. > > So before the 10:30 minutes, the DN is considered as fully available > by the NN. After this delay, HDFS is likely to start replicating the > blocks contained in the dead node to get back to the right number of > replica. As a consequence, if we're too aggressive we will have a side > effect here, adding workload to an already damaged cluster. According > to Stack: "even with this 10 minutes wait, the issue was met in real > production case in the past, and the latency increased badly". May be > there is some tuning to do here, but going under these 10 minutes does > not seem to be an easy path. > > For the clients, they don't fully rely on the NN feedback, and they > keep, per stream, a dead node list. So for a single file, a given > client will do the error once, but if there are multiple files it will > go back to the wrong DN. The settings are: > > connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout") > write: (5s (hardcoded) * NumberOfReplica) + 480s > ("dfs.datanode.socket.write.timeout") > > That will set a 69s timeout to get a "connect" error with the default config. > > I also had a look at larger failure scenarios, when we're loosing a > 20% of a cluster. The smaller the cluster is the easier it is to get > there. With the distributed log split, we're actually on a better > shape from an hdfs point of view: the master could have error writing > the files, because it could bet a dead DN 3 times in a row. If the > split is done by the RS, this issue disappears. We will however get a > lot of errors between the nodes. > > Finally, I had a look at the lease stuff Lease: write access lock to a > file, no other client can write to the file. But another client can > read it. Soft lease limit: another client can preempt the lease. > Configurable. > Default: 1 minute. > Hard lease limit: hdfs closes the file and free the resources on > behalf of the initial writer. Default: 60 minutes. > > => This should not impact HBase, as it does not prevent the recovery > process to read the WAL or to write new files. We just need writes to > be immediately available to readers, and it's possible thanks to > HDFS-200. So if a RS dies we should have no waits even if the lease > was not freed. This seems to be confirmed by tests. > => It's interesting to note that this setting is much more aggressive > than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in > HBase, than the default ZK timeout (3 minutes). > => This said, HDFS states this: "When reading a file open for writing, > the length of the last block still being written is unknown > to the NameNode. In this case, the client asks one of the replicas for > the latest length before starting to read its content.". This leads to > an extra call to get the file length on the recovery (likely with the > ipc.Client), and we may once again go to the wrong dead DN. In this > case we have an extra socket timeout to consider. > > On paper, it would be great to set "dfs.socket.timeout" to a minimal > value during a log split, as we know we will get a dead DN 33% of the > time. It may be more complicated in real life as the connections are > shared per process. And we could still have the issue with the > ipc.Client. > > > As a conclusion, I think it could be interesting to have a third > status for DN in HDFS: between live and dead as today, we could have > "sick". We would have: > 1) Dead, known as such => As today: Start to replicate the blocks to > other nodes. You enter this state after 10 minutes. We could even wait > more. > 2) Likely to be dead: don't propose it for write blocks, put it with a > lower priority for read blocks. We would enter this state in two > conditions: > 2.1) No heartbeat for 30 seconds (configurable of course). As there > is an existing heartbeat of 3 seconds, we could even be more > aggressive here. > 2.2) We could have a shutdown hook in hdfs such as when a DN dies > 'properly' it says to the NN, and the NN can put it in this 'half dead > state'. > => In all cases, the node stays in the second state until the 10.30 > timeout is reached or until a heartbeat is received. > 3) Live. > > For HBase it would make life much simpler I think: > - no 69s timeout on mttr path > - less connection to dead nodes leading to ressources held all other > the place finishing by a timeout... > - and there is already a very aggressive 3s heartbeat, so we would > not add any workload. > > Thougths? > > Nicolas -- Todd Lipcon Software Engineer, Cloudera
