On 28/06/11 04:49, Segel, Mike wrote:
Hmmm. I could have sworn there was a background balancing bandwidth limiter.
There is, for the rebalancer, node outages are taken more seriously,
though there have been problems in past 0.20.x where there was a risk of
a cascade failure on a big switch/rack failure. The risk has been
reduced, though we all await field reports to confirm this :)
You can get 12-24 TB in a server today, which means the loss of a server
generates a lot of traffic -which argues for 10 Gbe.
But
-big increase in switch cost, especially if you (CoI warning) go with
Cisco
-there have been problems with things like BIOS PXE and lights out
management on 10 Gbe -probably due to the NICs being things the BIOS
wasn't expecting and off the mainboard. This should improve.
-I don't know how well linux works with ether that fast (field reports
useful)
-the big threat is still ToR switch failure, as that will trigger a
re-replication of every block in the rack.
2x1 Gbe lets you have redundant switches, albeit at the price of more
wiring, more things to go wrong with the wiring, etc.
The other thing to consider is how well the "enterprise" switches work
in this world -with a Hadoop cluster you can really test those claims
how well the switches handle every port lighting up at full rate.
Indeed, I recommend that as part of your acceptance tests for the switch.