I never understood how hadoop can throttle an inter-rack fiber switch. Its supposed to operate on the principle of move-the-code to the data because of the I/O cost of moving the data, right?
On Mon, 6 Jun 2011 09:01:48 -0400, Joey Echeverria <[email protected]> wrote: > Larger Hadoop installations are space dense, 20-40 nodes per rack. > When you get to that density with multiple racks, it becomes expensive > to buy a switch with enough capacity for all of the nodes in all of > the racks. The typical solution is to install a switch per rack with > uplinks to a core switch to route between the racks. In this > arrangement, you'll be limited by the uplink bandwidth to the core > switch for interrack communication. Typically these uplinks are 10-20 > Gbps (bidirectional). > > Assuming you have 32 nodes in a rack with 1 Gbps links, then 20 Gbps > isn't enough bandwidth to push all of those ports at full tilt between > racks. That's why Hadoop has the ability to take advantage of rack > locality. It will try to schedule I/O local to a rack where it's less > likely to block. > > -Joey > > On Mon, Jun 6, 2011 at 7:04 AM, elton sky <[email protected]> wrote: >> Thanks for reply, Steve, >> >> I totally agree benchmark is a good idea. But the problem is I don't have >> switch to play with rather than a small cluster. >> I am curious of this and post the question. >> Can some experienced ppl can share their knowledge with us? >> >> Cheers >> >> On Mon, Jun 6, 2011 at 7:28 PM, Steve Loughran <[email protected]> wrote: >> >>> On 06/06/11 08:22, elton sky wrote: >>> >>>> hello everyone, >>>> >>>> As I don't have experience with big scale cluster, I cannot figure out >>>> why >>>> the inter-rack communication in a mapreduce job is "significantly" >>>> slower >>>> than intra-rack. >>>> I saw cisco catalyst 4900 series switch can reach upto 320Gbps >>>> forwarding >>>> capacity. Connected with 48 nodes with 1Gbps ethernet each, it should >>>> not >>>> be >>>> much contention at the switch, is it? >>>> >>> >>> I don't know enough about these switches; I do hear stories about >>> buffering >>> and the like, and I also hear that a lot of switches don't always >>> expect all >>> the ports to light up simultaneously. >>> >>> Outside hadoop, try setting up some simple bandwidth tests to measure >>> inter-rack bandwidth: have every node on one rack try and talk to one on >>> another at full rate. >>> >>> Set up every node talking to every other node at least once, to make >>> sure >>> there aren't odd problems between two nodes, which can happen if one of >>> the >>> NICs is playing up. >>> >>> Once you are happy that the basic bandwidth between servers is OK, then >>> it's time to start worrying adding hadoop to the mix >>> >>> -steve >>> >>
