There's complexity both ways. Not declaring the set of expected data
nodes introduces a lot of complexity and delay in namenode startup, as
seen in this thread.
In this respect, it feels like there's a fundamental difference between
batch systems and storage systems. For a MapReduce JobTracker or Condor
like system resource usage is transient, a job is executed on a node, it
completes and exits, the system no longer cares about the node. In a DFS
nodes store persistent data which mostly stays in place across namenode
re-starts. Checkpointing the mapping of blocks to nodes would make
startup much faster. Instead of waiting for datanodes to connect, the
namenode could poll and find out who was around. Adding nodes does
require co-ordination with the namenode, but it seems like node addition
represents a big enough discontinuity for most installations that
co-ordinating with the namenode is a small price to pay.
Sameer
Doug Cutting wrote:
I would rather avoid having to declare the set of expected data nodes if
we can avoid it, as I think it introduces a number of complexities. For
example, if you wish to add new data nodes, you cannot simply configure
them to point to the name node and start them. Assuming we add a notion
of 'on same rack' or 'on same switch' to dfs, and can ensure that copies
of a block are always held on multiple racks/switches, then it's
convenient to be able to safely take racks and switches offline and
online without coordinating with the namenode. If a switch fails at
startup, and 90% of the expected nodes are not available, we should
still start replication, no? I think a startup replication delay at the
namenode handles all of these cases. If we're worried that the
filesystem is unavailable, then we could make the delay smarter. The
namenode could delay some number of minutes or until every block is
accounted for, whichever comes first. And it could refuse/delay client
requests until the delay period is over, so that applications don't
start up until files are completely available.
Doug
Yoram Arnon wrote:
Right!
The name node, on startup, should know which data nodes are expected
to be
there, and not make replication decisions before he knows who's actually
there and who's not.
A crude way to achieve that is by just waiting for a while, hoping
that all
the data nodes connect.
A more refined way would be to compare who connected to who is
expected to
connect. It enables faster startup when everyone just connects
quickly, and
better robustness when some data nodes are slow to connect, or when
the name
node is slow to process the barrage of connections.
The rule could be "no replications until X% of the expected nodes have
connected, AND there are no pending unprocessed connection messages". X
should be on the order of 90, perhaps less for very small clusters.
Yoram
-----Original Message-----
From: Hairong Kuang [mailto:[EMAIL PROTECTED] Sent: Tuesday,
April 04, 2006 5:09 PM
To: [email protected]
Subject: RE: dfs datanode heartbeats and getBlockwork requests
I think it is better to implement the start-up delay at the namenode. But
the key is that the name node should be able to tell if it is in a steady
state or not either at start-up time or at runtime after a network
disruption. It should not instruct datanodes to replicate or delete any
blocks before it has reached a steady state.
Hairong
-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 04, 2006 9:58 AM
To: [email protected]
Subject: Re: dfs datanode heartbeats and getBlockwork requests
Eric Baldeschwieler wrote:
If we moved to a scheme where the name node was just given a small
number of blocks with each heartbeat, there would be no reason to not
start reporting blocks immediately, would there?
There would still be a small storm of un-needed replications on
startup. Say it takes a minute at startup for all data nodes to
report their
complete block lists to the name node. If heartbeats are every 3
seconds,
then all but the last data node to report in would be handed 20 small
lists
of blocks to start replicating. And the switches could be saturated
doing a
lot of un-needed transfers, which would slow startup. Then, for the
next minute after startup, the nodes would be told to delete
blocks that are now over-replicated. We'd like startup to be as fast and
painless as possible. Waiting a bit before checking to see if blocks are
over- or under-replicated seems a good way.
Doug