Re: dfs datanode heartbeats and getBlockwork requests

Eric Baldeschwieler Wed, 05 Apr 2006 20:57:25 -0700

but I think we should work through the separate argument aroundconfiguration. As sameer indicated, there are plenty of issues withour current approach. I've had a lot of experience with systemswhere you enumerate the nodes in your storage. Editing a single fileis really not a lot of burden to impose on system owner who wants toadd nodes.

It makes replication decisions much easier too. Much of the "ease ofadmin" and "simplicity" perceived to be gained by not enumerating thenodes is simply illusory. If you loose a rack and the system works,the system should recover, but how do you plan to define a rack? Ifyou loose some other 20% of your system, starting to replicate may ormay not be the best strategy. It certainly is not the best strategyon startup.

To repeat what sameer said, I think central configuration of the HDFSnodes is worth discussing. Its how I'd prefer to operate ourcluster. It also keeps the configuration of the clients simpler andallows you to trivially move the master if it poles, which can beimportant if you loose the master node. The map-reduce master isdifferent. There task trackers registering seems the more naturaldesign.


On Apr 5, 2006, at 12:11 PM, Yoram Arnon wrote:

Waiting until every block to be accounted for is a good approach,except
when a block is actually lost, which we expect to be rare.
Declaring the expected data nodes can be flexible however:initially the
list is empty, and it is populated as nodes first connect. It's kept
persistent, so that when the name node restarts it knows who wasconnectedwhen it went down, aka the expected list, and waits for them. Whena node isdeclared dead, and its blocks are replicated elsewhere, it is alsotaken offthe list. If/when it reconnects, it gets added back. That avoidshaving to
manually configure the list on the name node.
That said, it's useful to actully configure the name node manually,
preventing configuration errors where some data node connects tothe wrongname node. One central configuration is easier to control andmaintain thanmany remote configs. That's a separate argument though - we can getall we
need without this feature too.

Yoram

-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 05, 2006 10:23 AM
To: [email protected]
Subject: Re: dfs datanode heartbeats and getBlockwork requests
I would rather avoid having to declare the set of expected datanodes if we
can avoid it, as I think it introduces a number of complexities.  For
example, if you wish to add new data nodes, you cannot simplyconfigure themto point to the name node and start them. Assuming we add a notionof 'onsame rack' or 'on same switch' to dfs, and can ensure that copiesof a blockare always held on multiple racks/switches, then it's convenient tobe ableto safely take racks and switches offline and online withoutcoordinatingwith the namenode. If a switch fails at startup, and 90% of theexpectednodes are not available, we should still start replication, no? Ithink astartup replication delay at the namenode handles all of thesecases. Ifwe're worried that the filesystem is unavailable, then we couldmake thedelay smarter. The namenode could delay some number of minutes oruntil
every block is accounted for, whichever comes first.  And it could
refuse/delay client requests until the delay period is over, so that
applications don't start up until files are completely available.

Doug

Yoram Arnon wrote:
Right!
The name node, on startup, should know which data nodes are expected
to be there, and not make replication decisions before he knows who's
actually there and who's not.
A crude way to achieve that is by just waiting for a while, hoping
that all the data nodes connect.
A more refined way would be to compare who connected to who is
expected to connect. It enables faster startup when everyone just
connects quickly, and better robustness when some data nodes are slow
to connect, or when the name node is slow to process the barrage of
connections.
The rule could be "no replications until X% of the expected nodeshave
connected, AND there are no pending unprocessed connection messages".
X should be on the order of 90, perhaps less for very small clusters.

Yoram

-----Original Message-----
From: Hairong Kuang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 04, 2006 5:09 PM
To: [email protected]
Subject: RE: dfs datanode heartbeats and getBlockwork requests

I think it is better to implement the start-up delay at the namenode.
But the key is that the name node should be able to tell if it isin a
steady state or not either at start-up time or at runtime after a
network disruption. It should not instruct datanodes to replicate or
delete any blocks before it has reached a steady state.

Hairong

-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 04, 2006 9:58 AM
To: [email protected]
Subject: Re: dfs datanode heartbeats and getBlockwork requests

Eric Baldeschwieler wrote:
If we moved to a scheme where the name node was just given a small
number of blocks with each heartbeat, there would be no reason tonot
start reporting blocks immediately, would there?
There would still be a small storm of un-needed replications onstartup.
  Say it takes a minute at startup for all data nodes to report their
complete block lists to the name node.  If heartbeats are every 3
seconds, then all but the last data node to report in would be handed
20 small lists of blocks to start replicating. And the switchescouldbe saturated doing a lot of un-needed transfers, which would slowstartup.
  Then, for the next minute after startup, the nodes would be told to
delete blocks that are now over-replicated.  We'd like startup to be
as fast and painless as possible.  Waiting a bit before checking to
see if blocks are
over- or under-replicated seems a good way.

Doug

Re: dfs datanode heartbeats and getBlockwork requests

Reply via email to