Chris,
I've gone back through the thread and here's Elton's initial question...
> >> On 06/06/11 08:22, elton sky wrote:
> >>
> >>> hello everyone,
> >>>
> >>> As I don't have experience with big scale cluster, I cannot figure out
> why
> >>> the inter-rack communication in a mapreduce job is "significantly"
> slower
> >>> than intra-rack.
> >>> I saw cisco catalyst 4900 series switch can reach upto 320Gbps
> forwarding
> >>> capacity. Connected with 48 nodes with 1Gbps ethernet each, it should
> not
> >>> be
> >>> much contention at the switch, is it?
> >>>
Elton's question deals with why connections within the same switch are faster
than connections that traverse a set of switches.
The issue isn't so much one of the fabric within the switch itself, but the
width of the connection between the two switches.
If you have 40GBs (each direction) on a switch and you want it to communicate
seamlessly with machines on the next switch, you have to have be able to bond 4
10GBe ports together.
(Note: there's a bit more to it, but its the general idea.)
You're going to have a significant slow down on communication between nodes
that are on different racks because of the bandwidth limitations on the ports
used to connect the switches and not the 'fabric' within the switch itself.
To your point, you can monitor your jobs and see how much of your work is being
done by 'data local' tasks. In one job we had 519 tasks started where 482 were
'data local'.
So we had ~93% of the jobs where we didn't have an issue with any network
latency. And then with the 7% of the jobs, you have to consider what percentage
would have occurred where the data traffic is going to involve pulling data
across a 'trunk'. So yes, network latency isn't going to be a huge factor in
terms of improving overall efficiency.
However, that's just for Hadoop. What happens when you run HBase? ;-)
(You can have more network traffic during a m/r job.)
HTH
-Mike