Re: Why inter-rack communication in mapreduce slow?

elton sky Mon, 06 Jun 2011 18:25:49 -0700

Michael,
>Depending on your hardware, that's a fabric of 40GB, shared.
So that fabric is shared by all 42 ports. And even if I just used 2 ports
out of 42, connecting to 2 racks, if there's enough traffic coming, these 2
ports could use all 40GB. Is this right?


-Elton

On Tue, Jun 7, 2011 at 1:42 AM, Michael Segel <[email protected]>wrote:

>
> Well the problem is pretty basic.
>
> Take your typical 1 GBe switch with 42 ports.
> Each port is capable of doing 1 GBe in each direction across the switche's
> fabric.
> Depending on your hardware, that's a fabric of 40GB, shared.
>
> Depending on your hardware, you are usually using 1 or maybe 2 ports to
> 'trunk' to your network's back plane. (To keep this simple, lets just say
> that its a 1-2 GBe 'trunk' to your next rack.
> So you end up with 1GBe traffic from each node trying to communicate to
> another node on the next rack.  So if that's 20 nodes per rack and they all
> want to communicate... you end up with 20 GBe (each direction) trying to fit
> through a 1 - 2 GBe  pipe.
>
> Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people
> don't know how to drive. :-P
>
> The quick fix... spend the 8-10K per switch  to get a ToR that has 10+ GBe
> uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per
> rack.
>
> JMHO
>
> -Mike
>
>
>
> > To: [email protected]
> > Subject: Re: Why inter-rack communication in mapreduce slow?
> > Date: Mon, 6 Jun 2011 11:00:05 -0400
> > From: [email protected]
> >
> >
> > IMO, that's right. Because map/reduce/hadoop was originally designed for
> > that kind of text processing purpose. (i.e. few stages, low dependency,
> > highly parallel).
> >
> > Its when one tries to solve general purpose algorithms of modest
> > complexity that map/reduce gets into I/O churning problems.
> >
> > On Mon, 6 Jun 2011 23:58:53 +1000, elton sky <[email protected]>
> > wrote:
> > > Hi John,
> > >
> > > Because for map task, job tracker tries to assign them to local data
> > nodes,
> > > so there' not much n/w traffic.
> > > Then the only potential issue will be, as you said, reducers, which
> > copies
> > > data from all maps.
> > > So in other words, if the application only creates small intermediate
> > > output, e.g. grep, wordcount, this jam between racks is not likely
> > happen,
> > > is it?
> > >
> > >
> > > On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
> > > <[email protected]>wrote:
> > >
> > >> On Mon, 06 Jun 2011 09:34:56 -0400, <[email protected]> wrote:
> > >> > Yeah, that's a good point.
> > >> >
> > >> > I wonder though, what the load on the tracker nodes (port et. al)
> > would
> > >> > be if a inter-rack fiber switch at 10's of GBS' is getting maxed.
> > >> >
> > >> > Seems to me that if there is that much traffic being mitigate across
> > >> > racks, that the tracker node (or whatever node it is) would overload
> > >> > first?
> > >>
> > >> It could happen, but I don't think it would always.  For example,
> > tracker
> > >> is on rack A; sees that the best place to put reducer R is on rack B;
> > >> sees
> > >> reducer still needs a few hellabytes from mapper M on rack C; tells M
> > to
> > >> send data to R; switches on B and C get throttled, leaving A free to
> > >> handle
> > >> other things.
> > >>
> > >> In fact, it almost makes me wonder if an ideal setup is not only to
> > have
> > >> each of the main control daemons on their own nodes, but to put THOSE
> > >> nodes
> > >> on their own rack and keep all the data elsewhere.
> > >>
>
>

Re: Why inter-rack communication in mapreduce slow?

Reply via email to