Michael, >Depending on your hardware, that's a fabric of 40GB, shared. So that fabric is shared by all 42 ports. And even if I just used 2 ports out of 42, connecting to 2 racks, if there's enough traffic coming, these 2 ports could use all 40GB. Is this right?
-Elton On Tue, Jun 7, 2011 at 1:42 AM, Michael Segel <[email protected]>wrote: > > Well the problem is pretty basic. > > Take your typical 1 GBe switch with 42 ports. > Each port is capable of doing 1 GBe in each direction across the switche's > fabric. > Depending on your hardware, that's a fabric of 40GB, shared. > > Depending on your hardware, you are usually using 1 or maybe 2 ports to > 'trunk' to your network's back plane. (To keep this simple, lets just say > that its a 1-2 GBe 'trunk' to your next rack. > So you end up with 1GBe traffic from each node trying to communicate to > another node on the next rack. So if that's 20 nodes per rack and they all > want to communicate... you end up with 20 GBe (each direction) trying to fit > through a 1 - 2 GBe pipe. > > Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people > don't know how to drive. :-P > > The quick fix... spend the 8-10K per switch to get a ToR that has 10+ GBe > uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per > rack. > > JMHO > > -Mike > > > > > To: [email protected] > > Subject: Re: Why inter-rack communication in mapreduce slow? > > Date: Mon, 6 Jun 2011 11:00:05 -0400 > > From: [email protected] > > > > > > IMO, that's right. Because map/reduce/hadoop was originally designed for > > that kind of text processing purpose. (i.e. few stages, low dependency, > > highly parallel). > > > > Its when one tries to solve general purpose algorithms of modest > > complexity that map/reduce gets into I/O churning problems. > > > > On Mon, 6 Jun 2011 23:58:53 +1000, elton sky <[email protected]> > > wrote: > > > Hi John, > > > > > > Because for map task, job tracker tries to assign them to local data > > nodes, > > > so there' not much n/w traffic. > > > Then the only potential issue will be, as you said, reducers, which > > copies > > > data from all maps. > > > So in other words, if the application only creates small intermediate > > > output, e.g. grep, wordcount, this jam between racks is not likely > > happen, > > > is it? > > > > > > > > > On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong > > > <[email protected]>wrote: > > > > > >> On Mon, 06 Jun 2011 09:34:56 -0400, <[email protected]> wrote: > > >> > Yeah, that's a good point. > > >> > > > >> > I wonder though, what the load on the tracker nodes (port et. al) > > would > > >> > be if a inter-rack fiber switch at 10's of GBS' is getting maxed. > > >> > > > >> > Seems to me that if there is that much traffic being mitigate across > > >> > racks, that the tracker node (or whatever node it is) would overload > > >> > first? > > >> > > >> It could happen, but I don't think it would always. For example, > > tracker > > >> is on rack A; sees that the best place to put reducer R is on rack B; > > >> sees > > >> reducer still needs a few hellabytes from mapper M on rack C; tells M > > to > > >> send data to R; switches on B and C get throttled, leaving A free to > > >> handle > > >> other things. > > >> > > >> In fact, it almost makes me wonder if an ideal setup is not only to > > have > > >> each of the main control daemons on their own nodes, but to put THOSE > > >> nodes > > >> on their own rack and keep all the data elsewhere. > > >> > >
