On 01/07/2011 07:41, Evert Lammerts wrote:
Keeping the amount of disks per node low and the amount of nodes high
should keep the impact of dead nodes in control.

It keeps the impact of dead nodes in control but I don't think thats
long-term cost efficient.  As prices of 10GbE go down, the "keep the node
small" arguement seems less fitting.  And on another note, most servers
manufactured in the last 10 years have dual 1GbE network interfaces.  If one
were to go by these calcs:

150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32
minutes to recover

It seems like that assumes a single 1GbE interface, why  not leverage the
second?

I don't know how others setup up their clusters. We have the tradition that 
every node in a cluster has at least three interfaces - one for interconnects, 
one for a management network (only reachable from within our own network and 
the primary interface for our admins, accessible only through a single 
management node) and one for ILOM, DRAC or whatever lights out manager is 
available. This doesn't leave us room for bonding interfaces on off the shelf 
nodes. Plus - you'd need twice as many ports in your switch.

Yes, I didn't get into ILO. That can be 100Mbps.

In the case of Hadoop we're considering adding a fourth NIC for external 
connectivity. We don't want people interacting with HDFS from outside while 
jobs are using the interconnects.

Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb 
ethernet prices are approaching that of 1Gb ethernet it gets more attractive. 
The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared 
to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just 
saying that since we have other variables to tweak - amount of disks and number 
of nodes - we can limit the impact of minimizing recovery times.

2Gb bonded with 2x ToR removes a single ToR switch as an SPOF for the rack but increases install/debugging costs. More wires, and your network configuration topology has got worse.


Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more 
attractive to stop using HDFS (or at least, data locality) and start using an 
external storage cluster. Compute node failure then has no impact, disk failure 
is hardly noticed by compute nodes. But this is really still very far from as 
cheap as many small nodes with relatively little disks - I really like the bang 
for buck you get with Hadoop :-)

Yeah, but you are a computing facility whose backbone gets data off the LHC tier 1 sites abroad faster than the seek time of the disks in your neighbouring building...

Reply via email to