> Date: Tue, 7 Jun 2011 11:24:53 +1000
> Subject: Re: Why inter-rack communication in mapreduce slow?
> From: [email protected]
> To: [email protected]
>
> Michael,
> >Depending on your hardware, that's a fabric of 40GB, shared.
> So that fabric is shared by all 42 ports. And even if I just used 2 ports
> out of 42, connecting to 2 racks, if there's enough traffic coming, these 2
> ports could use all 40GB. Is this right?
>
> -Elton
>
Elton,
That 40GB is shared across all of the ports in the switch. But the maximum
bandwidth each port can handle is only 1GBe bidirectional.
So if you've got 40GB/s of traffic w only 1 1GBe port, you can only push 1GB/s
through the port.
I realize I'm over simplifying it, but that's the basic problem. This is why
you will find switches that have 1GBe ports with 10GBe uplinks.
HTH
-Mike
> On Tue, Jun 7, 2011 at 1:42 AM, Michael Segel
> <[email protected]>wrote:
>
> >
> > Well the problem is pretty basic.
> >
> > Take your typical 1 GBe switch with 42 ports.
> > Each port is capable of doing 1 GBe in each direction across the switche's
> > fabric.
> > Depending on your hardware, that's a fabric of 40GB, shared.
> >
> > Depending on your hardware, you are usually using 1 or maybe 2 ports to
> > 'trunk' to your network's back plane. (To keep this simple, lets just say
> > that its a 1-2 GBe 'trunk' to your next rack.
> > So you end up with 1GBe traffic from each node trying to communicate to
> > another node on the next rack. So if that's 20 nodes per rack and they all
> > want to communicate... you end up with 20 GBe (each direction) trying to fit
> > through a 1 - 2 GBe pipe.
> >
> > Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people
> > don't know how to drive. :-P
> >
> > The quick fix... spend the 8-10K per switch to get a ToR that has 10+ GBe
> > uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per
> > rack.
> >
> > JMHO
> >
> > -Mike
> >
> >
> >
> > > To: [email protected]
> > > Subject: Re: Why inter-rack communication in mapreduce slow?
> > > Date: Mon, 6 Jun 2011 11:00:05 -0400
> > > From: [email protected]
> > >
> > >
> > > IMO, that's right. Because map/reduce/hadoop was originally designed for
> > > that kind of text processing purpose. (i.e. few stages, low dependency,
> > > highly parallel).
> > >
> > > Its when one tries to solve general purpose algorithms of modest
> > > complexity that map/reduce gets into I/O churning problems.
> > >
> > > On Mon, 6 Jun 2011 23:58:53 +1000, elton sky <[email protected]>
> > > wrote:
> > > > Hi John,
> > > >
> > > > Because for map task, job tracker tries to assign them to local data
> > > nodes,
> > > > so there' not much n/w traffic.
> > > > Then the only potential issue will be, as you said, reducers, which
> > > copies
> > > > data from all maps.
> > > > So in other words, if the application only creates small intermediate
> > > > output, e.g. grep, wordcount, this jam between racks is not likely
> > > happen,
> > > > is it?
> > > >
> > > >
> > > > On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
> > > > <[email protected]>wrote:
> > > >
> > > >> On Mon, 06 Jun 2011 09:34:56 -0400, <[email protected]> wrote:
> > > >> > Yeah, that's a good point.
> > > >> >
> > > >> > I wonder though, what the load on the tracker nodes (port et. al)
> > > would
> > > >> > be if a inter-rack fiber switch at 10's of GBS' is getting maxed.
> > > >> >
> > > >> > Seems to me that if there is that much traffic being mitigate across
> > > >> > racks, that the tracker node (or whatever node it is) would overload
> > > >> > first?
> > > >>
> > > >> It could happen, but I don't think it would always. For example,
> > > tracker
> > > >> is on rack A; sees that the best place to put reducer R is on rack B;
> > > >> sees
> > > >> reducer still needs a few hellabytes from mapper M on rack C; tells M
> > > to
> > > >> send data to R; switches on B and C get throttled, leaving A free to
> > > >> handle
> > > >> other things.
> > > >>
> > > >> In fact, it almost makes me wonder if an ideal setup is not only to
> > > have
> > > >> each of the main control daemons on their own nodes, but to put THOSE
> > > >> nodes
> > > >> on their own rack and keep all the data elsewhere.
> > > >>
> >
> >