Just to close this thread. Turns out it all came down to a mapred.reduce.parallel.copies being overwritten to 5 on the Hive submission. Cranking that back up and everything is happy again.
Thanks for the ideas, Tim On Thu, Nov 18, 2010 at 11:04 AM, Tim Robertson <timrobertson...@gmail.com> wrote: > Thanks again. > > We are getting closer to debugging this. Our reference for all these > tests was a simple GroupBy using Hive, but when I do a vanilla MR job > on the tab file input to do the same group by, it flies through - > almost exactly 2 times quicker. Investigating further as it is not > quite a fair test at the moment due to some config differences... > > > On Thu, Nov 18, 2010 at 10:19 AM, Friso van Vollenhoven > <fvanvollenho...@xebia.com> wrote: >> Do you have IPv6 enabled on the boxes? If DNS gives both IPv4 and IPv6 >> results for lookups, Java will try v6 first and then fall back to v4, which >> is an additional connect attempt. You can force Java to use only v4 by >> setting the system property java.net.preferIPv4Stack=true. >> >> Also, I am not sure whether Java does the same thing as nslookup when doing >> name lookups (I believe it has its own cache as well, but correct me if I'm >> wrong). >> >> You could try running something like strace (with the -T option, which shows >> time spent in system calls) to see whether network related system calls take >> a long time. >> >> >> >> Friso >> >> >> >> >> On 17 nov 2010, at 22:20, Tim Robertson wrote: >> >>> I don't think so Aaron - but we use names not IPs in the config and on >>> a node the following is instant: >>> >>> [r...@c2n1 ~]# nslookup c1n1.gbif.org >>> Server: 130.226.238.254 >>> Address: 130.226.238.254#53 >>> >>> Non-authoritative answer: >>> Name: c1n1.gbif.org >>> Address: 130.226.238.171 >>> >>> If I ssh onto an arbitrary machine in the cluster and pull a file >>> using curl (e.g. >>> http://c1n9.gbif.org:50075/streamFile?filename=%2Fuser%2Fhive%2Fwarehouse%2Feol_density2_4%2Fattempt_201011151423_0027_m_000000_0&delegation=null) >>> it comes down at 110M/s with no delay on DNS lookup. >>> >>> Is there a better test I can do? - I am not so much a network guy... >>> Cheers, >>> Tim >>> >>> >>> >>> >>> >>> On Wed, Nov 17, 2010 at 10:08 PM, Aaron Kimball <akimbal...@gmail.com> >>> wrote: >>>> Tim, >>>> Are there issues with DNS caching (or lack thereof), misconfigured >>>> /etc/hosts, or other network-config gotchas that might be preventing >>>> network >>>> connections between hosts from opening efficiently? >>>> - Aaron >>>> >>>> On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson <timrobertson...@gmail.com> >>>> wrote: >>>>> >>>>> Thanks Friso, >>>>> >>>>> We've been trying to diagnose all day and still did not find a solution. >>>>> We're running cacti and IO wait is down at 0.5%, M&R are tuned right >>>>> down to 1M 1R on each machine, and the machine CPUs are almost idle >>>>> with no swap. >>>>> Using curl to pull a file from a DN comes down at 110m/s. >>>>> >>>>> We are now upping things like epoll >>>>> >>>>> Any ideas really greatly appreciated at this stage! >>>>> Tim >>>>> >>>>> >>>>> On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven >>>>> <fvanvollenho...@xebia.com> wrote: >>>>>> Hi Tim, >>>>>> Getting 28K of map outputs to reducers should not take minutes. Reducers >>>>>> on >>>>>> a properly setup (1Gb) network should be copying at multiple MB/s. I >>>>>> think >>>>>> you need to get some more info. >>>>>> Apart from top, you'll probably also want to look at iostat and vmstat. >>>>>> The >>>>>> first will tell you something about disk utilization and the latter can >>>>>> tell >>>>>> you whether the machines are using swap or not. This is very important. >>>>>> If >>>>>> you are over utilizing physical memory on the machines, thing will be >>>>>> slow. >>>>>> It's even better if you put something in place that allows you to get an >>>>>> overall view of the resource usage across the cluster. Look at Ganglia >>>>>> (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or >>>>>> something similar. >>>>>> Basically a job is either CPU bound, IO bound or network bound. You need >>>>>> to >>>>>> be able to look at all three to see what the bottleneck is. Also, you >>>>>> can >>>>>> run into churn when you saturate resources and processes are competing >>>>>> for >>>>>> them (e.g. when you have two disks and 50 processes / threads reading >>>>>> from >>>>>> them, things will be slow because the OS needs to switch between them a >>>>>> lot >>>>>> and overall throughput will be less than what the disks can do; you can >>>>>> see >>>>>> this when there is a lot of time in iowait, but overall throughput is >>>>>> low so >>>>>> there's a lot of seeks going on). >>>>>> >>>>>> >>>>>> >>>>>> On 17 nov 2010, at 09:43, Tim Robertson wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> We have setup a small cluster (13 nodes) using CDH3 >>>>>> >>>>>> We have been tuning it using TeraSort and Hive queries on our data, >>>>>> and the copy phase is very slow, so I'd like to ask if anyone can look >>>>>> over our config. >>>>>> >>>>>> We have an unbalanced set of machines (all on a single switch): >>>>>> - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2 >>>>>> reducers) >>>>>> - 3 of Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12 >>>>>> mappers, 12 reducers) >>>>>> >>>>>> We monitored the load using $top on machines, to settle on the number >>>>>> of mappers and reducers to stop overloading them, and the map() and >>>>>> reduce() is working very nicely - all our time >>>>>> >>>>>> The config: >>>>>> >>>>>> io.sort.mb=400 >>>>>> io.sort.factor=100 >>>>>> mapred.reduce.parallel.copies=20 >>>>>> tasktracker.http.threads=80 >>>>>> mapred.compress.map.output=true/false (no notible difference) >>>>>> mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec >>>>>> mapred.output.compression.type=BLOCK >>>>>> mapred.inmem.merge.threshold=0 >>>>>> mapred.job.reduce.input.buffer.percent=0.7 >>>>>> mapred.job.reuse.jvm.num.tasks=50 >>>>>> >>>>>> An example job: >>>>>> (select basis_of_record,count(1) from occurrence_record group by >>>>>> basis_of_record) >>>>>> Map input records 262,573,931 finished in 2mins30 using 833 mappers >>>>>> Reduce was at 24% at 2mins30 finished map with all 55 running >>>>>> Map output records: 1,855 >>>>>> Map output bytes: 28,724 >>>>>> REDUCE COPY PHASE finished after 7mins01 secs >>>>>> Reduce finished after 7mins17secs >>>>>> >>>>>> I am correct that 28,724 bytes emitted from a map should not take >>>>>> 4mins30 >>>>>> right? >>>>>> >>>>>> We're running puppet so can test changes quickly. >>>>>> >>>>>> Any pointers on how we can debug / improve this are greatly appreciated! >>>>>> Tim >>>>>> >>>>>> >>>> >>>> >> >> >