Re: dfs datanode heartbeats and getBlockwork requests

Sameer Paranjpye Wed, 05 Apr 2006 11:51:47 -0700

There's complexity both ways. Not declaring the set of expected datanodes introduces a lot of complexity and delay in namenode startup, asseen in this thread.

In this respect, it feels like there's a fundamental difference betweenbatch systems and storage systems. For a MapReduce JobTracker or Condorlike system resource usage is transient, a job is executed on a node, itcompletes and exits, the system no longer cares about the node. In a DFSnodes store persistent data which mostly stays in place across namenodere-starts. Checkpointing the mapping of blocks to nodes would makestartup much faster. Instead of waiting for datanodes to connect, thenamenode could poll and find out who was around. Adding nodes doesrequire co-ordination with the namenode, but it seems like node additionrepresents a big enough discontinuity for most installations thatco-ordinating with the namenode is a small price to pay.


Sameer







Doug Cutting wrote:

I would rather avoid having to declare the set of expected data nodes ifwe can avoid it, as I think it introduces a number of complexities. Forexample, if you wish to add new data nodes, you cannot simply configurethem to point to the name node and start them. Assuming we add a notionof 'on same rack' or 'on same switch' to dfs, and can ensure that copiesof a block are always held on multiple racks/switches, then it'sconvenient to be able to safely take racks and switches offline andonline without coordinating with the namenode. If a switch fails atstartup, and 90% of the expected nodes are not available, we shouldstill start replication, no? I think a startup replication delay at thenamenode handles all of these cases. If we're worried that thefilesystem is unavailable, then we could make the delay smarter. Thenamenode could delay some number of minutes or until every block isaccounted for, whichever comes first. And it could refuse/delay clientrequests until the delay period is over, so that applications don'tstart up until files are completely available.
Doug

Yoram Arnon wrote:
Right!
The name node, on startup, should know which data nodes are expectedto be
there, and not make replication decisions before he knows who's actually
there and who's not.
A crude way to achieve that is by just waiting for a while, hopingthat all
the data nodes connect.
A more refined way would be to compare who connected to who isexpected toconnect. It enables faster startup when everyone just connectsquickly, andbetter robustness when some data nodes are slow to connect, or whenthe name
node is slow to process the barrage of connections.
The rule could be "no replications until X% of the expected nodes have
connected, AND there are no pending unprocessed connection messages". X
should be on the order of 90, perhaps less for very small clusters.

Yoram

-----Original Message-----
From: Hairong Kuang [mailto:[EMAIL PROTECTED] Sent: Tuesday,April 04, 2006 5:09 PM
To: [email protected]
Subject: RE: dfs datanode heartbeats and getBlockwork requests

I think it is better to implement the start-up delay at the namenode. But
the key is that the name node should be able to tell if it is in a steady
state or not either at start-up time or at runtime after a network
disruption. It should not instruct datanodes to replicate or delete any
blocks before it has reached a steady state.

Hairong
-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 04, 2006 9:58 AM
To: [email protected]
Subject: Re: dfs datanode heartbeats and getBlockwork requests

Eric Baldeschwieler wrote:
If we moved to a scheme where the name node was just given a smallnumber of blocks with each heartbeat, there would be no reason to notstart reporting blocks immediately, would there?
There would still be a small storm of un-needed replications onstartup. Say it takes a minute at startup for all data nodes toreport theircomplete block lists to the name node. If heartbeats are every 3seconds,then all but the last data node to report in would be handed 20 smalllistsof blocks to start replicating. And the switches could be saturateddoing alot of un-needed transfers, which would slow startup. Then, for thenext minute after startup, the nodes would be told to delete
blocks that are now over-replicated.  We'd like startup to be as fast and
painless as possible.  Waiting a bit before checking to see if blocks are
over- or under-replicated seems a good way.

Doug

Re: dfs datanode heartbeats and getBlockwork requests

Reply via email to