Just to close this thread.
Turns out it all came down to a mapred.reduce.parallel.copies being
overwritten to 5 on the Hive submission. Cranking that back up and
everything is happy again.

Thanks for the ideas,

Tim




On Thu, Nov 18, 2010 at 11:04 AM, Tim Robertson
<timrobertson...@gmail.com> wrote:
> Thanks again.
>
> We are getting closer to debugging this.  Our reference for all these
> tests was a simple GroupBy using Hive, but when I do a vanilla MR job
> on the tab file input to do the same group by, it flies through -
> almost exactly 2 times quicker.  Investigating further as it is not
> quite a fair test at the moment due to some config differences...
>
>
> On Thu, Nov 18, 2010 at 10:19 AM, Friso van Vollenhoven
> <fvanvollenho...@xebia.com> wrote:
>> Do you have IPv6 enabled on the boxes? If DNS gives both IPv4 and IPv6 
>> results for lookups, Java will try v6 first and then fall back to v4, which 
>> is an additional connect attempt. You can force Java to use only v4 by 
>> setting the system property java.net.preferIPv4Stack=true.
>>
>> Also, I am not sure whether Java does the same thing as nslookup when doing 
>> name lookups (I believe it has its own cache as well, but correct me if I'm 
>> wrong).
>>
>> You could try running something like strace (with the -T option, which shows 
>> time spent in system calls) to see whether network related system calls take 
>> a long time.
>>
>>
>>
>> Friso
>>
>>
>>
>>
>> On 17 nov 2010, at 22:20, Tim Robertson wrote:
>>
>>> I don't think so Aaron - but we use names not IPs in the config and on
>>> a node the following is instant:
>>>
>>> [r...@c2n1 ~]# nslookup c1n1.gbif.org
>>> Server:               130.226.238.254
>>> Address:      130.226.238.254#53
>>>
>>> Non-authoritative answer:
>>> Name: c1n1.gbif.org
>>> Address: 130.226.238.171
>>>
>>> If I ssh onto an arbitrary machine in the cluster and pull a file
>>> using curl (e.g.
>>> http://c1n9.gbif.org:50075/streamFile?filename=%2Fuser%2Fhive%2Fwarehouse%2Feol_density2_4%2Fattempt_201011151423_0027_m_000000_0&delegation=null)
>>> it comes down at 110M/s with no delay on DNS lookup.
>>>
>>> Is there a better test I can do? - I am not so much a network guy...
>>> Cheers,
>>> Tim
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 17, 2010 at 10:08 PM, Aaron Kimball <akimbal...@gmail.com> 
>>> wrote:
>>>> Tim,
>>>> Are there issues with DNS caching (or lack thereof), misconfigured
>>>> /etc/hosts, or other network-config gotchas that might be preventing 
>>>> network
>>>> connections between hosts from opening efficiently?
>>>> - Aaron
>>>>
>>>> On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson <timrobertson...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Thanks Friso,
>>>>>
>>>>> We've been trying to diagnose all day and still did not find a solution.
>>>>> We're running cacti and IO wait is down at 0.5%, M&R are tuned right
>>>>> down to 1M 1R on each machine, and the machine CPUs are almost idle
>>>>> with no swap.
>>>>> Using curl to pull a file from a DN comes down at 110m/s.
>>>>>
>>>>> We are now upping things like epoll
>>>>>
>>>>> Any ideas really greatly appreciated at this stage!
>>>>> Tim
>>>>>
>>>>>
>>>>> On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven
>>>>> <fvanvollenho...@xebia.com> wrote:
>>>>>> Hi Tim,
>>>>>> Getting 28K of map outputs to reducers should not take minutes. Reducers
>>>>>> on
>>>>>> a properly setup (1Gb) network should be copying at multiple MB/s. I
>>>>>> think
>>>>>> you need to get some more info.
>>>>>> Apart from top, you'll probably also want to look at iostat and vmstat.
>>>>>> The
>>>>>> first will tell you something about disk utilization and the latter can
>>>>>> tell
>>>>>> you whether the machines are using swap or not. This is very important.
>>>>>> If
>>>>>> you are over utilizing physical memory on the machines, thing will be
>>>>>> slow.
>>>>>> It's even better if you put something in place that allows you to get an
>>>>>> overall view of the resource usage across the cluster. Look at Ganglia
>>>>>> (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or
>>>>>> something similar.
>>>>>> Basically a job is either CPU bound, IO bound or network bound. You need
>>>>>> to
>>>>>> be able to look at all three to see what the bottleneck is. Also, you
>>>>>> can
>>>>>> run into churn when you saturate resources and processes are competing
>>>>>> for
>>>>>> them (e.g. when you have two disks and 50 processes / threads reading
>>>>>> from
>>>>>> them, things will be slow because the OS needs to switch between them a
>>>>>> lot
>>>>>> and overall throughput will be less than what the disks can do; you can
>>>>>> see
>>>>>> this when there is a lot of time in iowait, but overall throughput is
>>>>>> low so
>>>>>> there's a lot of seeks going on).
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 17 nov 2010, at 09:43, Tim Robertson wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We have setup a small cluster (13 nodes) using CDH3
>>>>>>
>>>>>> We have been tuning it using TeraSort and Hive queries on our data,
>>>>>> and the copy phase is very slow, so I'd like to ask if anyone can look
>>>>>> over our config.
>>>>>>
>>>>>> We have an unbalanced set of machines (all on a single switch):
>>>>>> - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2
>>>>>> reducers)
>>>>>> - 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
>>>>>> mappers, 12 reducers)
>>>>>>
>>>>>> We monitored the load using $top on machines, to settle on the number
>>>>>> of mappers and reducers to stop overloading them, and the map() and
>>>>>> reduce() is working very nicely - all our time
>>>>>>
>>>>>> The config:
>>>>>>
>>>>>> io.sort.mb=400
>>>>>> io.sort.factor=100
>>>>>> mapred.reduce.parallel.copies=20
>>>>>> tasktracker.http.threads=80
>>>>>> mapred.compress.map.output=true/false (no notible difference)
>>>>>> mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
>>>>>> mapred.output.compression.type=BLOCK
>>>>>> mapred.inmem.merge.threshold=0
>>>>>> mapred.job.reduce.input.buffer.percent=0.7
>>>>>> mapred.job.reuse.jvm.num.tasks=50
>>>>>>
>>>>>> An example job:
>>>>>> (select basis_of_record,count(1) from occurrence_record group by
>>>>>> basis_of_record)
>>>>>> Map input records 262,573,931 finished in 2mins30 using 833 mappers
>>>>>> Reduce was at 24% at 2mins30 finished map with all 55 running
>>>>>> Map output records: 1,855
>>>>>> Map output bytes: 28,724
>>>>>> REDUCE COPY PHASE finished after 7mins01 secs
>>>>>> Reduce finished after 7mins17secs
>>>>>>
>>>>>> I am correct that 28,724 bytes emitted from a map should not take
>>>>>> 4mins30
>>>>>> right?
>>>>>>
>>>>>> We're running puppet so can test changes quickly.
>>>>>>
>>>>>> Any pointers on how we can debug / improve this are greatly appreciated!
>>>>>> Tim
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>

Reply via email to