On 01/07/2011 07:41, Evert Lammerts wrote:
Keeping the amount of disks per node low and the amount of nodes high
should keep the impact of dead nodes in control.
It keeps the impact of dead nodes in control but I don't think thats
long-term cost efficient. As prices of 10GbE go down, the "keep the node
small" arguement seems less fitting. And on another note, most servers
manufactured in the last 10 years have dual 1GbE network interfaces. If one
were to go by these calcs:
150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32
minutes to recover
It seems like that assumes a single 1GbE interface, why not leverage the
second?
I don't know how others setup up their clusters. We have the tradition that
every node in a cluster has at least three interfaces - one for interconnects,
one for a management network (only reachable from within our own network and
the primary interface for our admins, accessible only through a single
management node) and one for ILOM, DRAC or whatever lights out manager is
available. This doesn't leave us room for bonding interfaces on off the shelf
nodes. Plus - you'd need twice as many ports in your switch.
Yes, I didn't get into ILO. That can be 100Mbps.
In the case of Hadoop we're considering adding a fourth NIC for external
connectivity. We don't want people interacting with HDFS from outside while
jobs are using the interconnects.
Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb
ethernet prices are approaching that of 1Gb ethernet it gets more attractive.
The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared
to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just
saying that since we have other variables to tweak - amount of disks and number
of nodes - we can limit the impact of minimizing recovery times.
2Gb bonded with 2x ToR removes a single ToR switch as an SPOF for the
rack but increases install/debugging costs. More wires, and your network
configuration topology has got worse.
Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more
attractive to stop using HDFS (or at least, data locality) and start using an
external storage cluster. Compute node failure then has no impact, disk failure
is hardly noticed by compute nodes. But this is really still very far from as
cheap as many small nodes with relatively little disks - I really like the bang
for buck you get with Hadoop :-)
Yeah, but you are a computing facility whose backbone gets data off the
LHC tier 1 sites abroad faster than the seek time of the disks in your
neighbouring building...