Re: Why inter-rack communication in mapreduce slow?

darren Mon, 06 Jun 2011 06:19:32 -0700

I never understood how hadoop can throttle an inter-rack fiber switch.
Its supposed to operate on the principle of move-the-code to the data
because of the I/O cost of moving the data, right?


On Mon, 6 Jun 2011 09:01:48 -0400, Joey Echeverria <[email protected]>
wrote:
> Larger Hadoop installations are space dense, 20-40 nodes per rack.
> When you get to that density with multiple racks, it becomes expensive
> to buy a switch with enough capacity for all of the nodes in all of
> the racks. The typical solution is to install a switch per rack with
> uplinks to a core switch to route between the racks. In this
> arrangement, you'll be limited by the uplink bandwidth to the core
> switch for interrack communication. Typically these uplinks are 10-20
> Gbps (bidirectional).
> 
> Assuming you have 32 nodes in a rack with 1 Gbps links, then 20 Gbps
> isn't enough bandwidth to push all of those ports at full tilt between
> racks. That's why Hadoop has the ability to take advantage of rack
> locality. It will try to schedule I/O local to a rack where it's less
> likely to block.
> 
> -Joey
> 
> On Mon, Jun 6, 2011 at 7:04 AM, elton sky <[email protected]>
wrote:
>> Thanks for reply, Steve,
>>
>> I totally agree benchmark is a good idea. But the problem is I don't
have
>> switch to play with rather than a small cluster.
>> I am curious of this and post the question.
>> Can some experienced ppl can share their knowledge with us?
>>
>> Cheers
>>
>> On Mon, Jun 6, 2011 at 7:28 PM, Steve Loughran <[email protected]>
wrote:
>>
>>> On 06/06/11 08:22, elton sky wrote:
>>>
>>>> hello everyone,
>>>>
>>>> As I don't have experience with big scale cluster, I cannot figure
out
>>>> why
>>>> the inter-rack communication in a mapreduce job is "significantly"
>>>> slower
>>>> than intra-rack.
>>>> I saw cisco catalyst 4900 series switch can reach upto 320Gbps
>>>> forwarding
>>>> capacity. Connected with 48 nodes with 1Gbps ethernet each, it should
>>>> not
>>>> be
>>>> much contention at the switch, is it?
>>>>
>>>
>>> I don't know enough about these switches; I do hear stories about
>>> buffering
>>> and the like, and I also hear that a lot of switches don't always
>>> expect all
>>> the ports to light up simultaneously.
>>>
>>> Outside hadoop, try setting up some simple bandwidth tests to measure
>>> inter-rack bandwidth: have every node on one rack try and talk to one
on
>>> another at full rate.
>>>
>>> Set up every node talking to every other node at least once, to make
>>> sure
>>> there aren't odd problems between two nodes, which can happen if one
of
>>> the
>>> NICs is playing up.
>>>
>>> Once you are happy that the basic bandwidth between servers is OK,
then
>>> it's time to start worrying adding hadoop to the mix
>>>
>>> -steve
>>>
>>

Re: Why inter-rack communication in mapreduce slow?

Reply via email to