Chris,

I've gone back through the thread and here's Elton's initial question...

> >> On 06/06/11 08:22, elton sky wrote:
> >>
> >>> hello everyone,
> >>>
> >>> As I don't have experience with big scale cluster, I cannot figure out
> why
> >>> the inter-rack communication in a mapreduce job is "significantly"
> slower
> >>> than intra-rack.
> >>> I saw cisco catalyst 4900 series switch can reach upto 320Gbps
> forwarding
> >>> capacity. Connected with 48 nodes with 1Gbps ethernet each, it should
> not
> >>> be
> >>> much contention at the switch, is it?
> >>>
Elton's question deals with why connections within the same switch are faster 
than connections that traverse a set of switches.
The issue isn't so much one of the fabric within the switch itself, but the 
width of the connection between the two switches.

If you have 40GBs (each direction) on a switch and you want it to communicate 
seamlessly with machines on the next switch, you have to have be able to bond 4 
10GBe ports together.
(Note: there's a bit more to it, but its the general idea.)

You're going to have a significant slow down on communication between nodes 
that are on different racks because of the bandwidth limitations on the ports 
used to connect the switches and not the 'fabric' within the switch itself.

To your point, you can monitor your jobs and see how much of your work is being 
done by 'data local' tasks. In one job we had 519 tasks started where 482 were 
'data local'. 
So we had ~93% of the jobs where we didn't have an issue with any network 
latency. And then with the 7% of the jobs, you have to consider what percentage 
would have occurred where the data traffic is going to involve pulling data 
across a 'trunk'.  So yes, network latency isn't going to be a huge factor in 
terms of improving overall efficiency.

However, that's just for Hadoop. What happens when you run HBase? ;-)
(You can have more network traffic during a m/r job.)

HTH

-Mike


                                          

Reply via email to