Hi everyone,
I am using hadoop (17) to try and do some large scale user comparisons and
although the programs are all written, its taking incredibly long to run and
it seems like it should be going faster. I would really like some insight as
to what I could do to speed this up aside from just "add more computers". I
would really appreciate some help from all of the sagacious hadoop
core-users.

The basic idea is that there are a bunch of users, each of which is some
groups. I would like to know how many users each combination of groups has
in common. I laid out the data using sequence files which seems to be
working well and quickly, each sequence file entry has a text user name and
a map writable which contains all of the groups they are in. The map
function takes in each user and the outputs all of the combinations of the
groups for which it is a part of and a 1 which is the instance counter(like
in wordcount). So user x which is a member of groups 1,2,3,4 will output
1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I made
a collector which reduces the number of records about 10x. Reducer is really
simple just sums up the total for each combination and then outputs it to a
file. Just as an aside, I make sure to use intwritables just about
everywhere which I hoped would help since there are inevitable tons of
comparisons going on. 

This is being done on about 4gb of user data on an 20 Large instance cluster
on Amazons EC2. With that much data, there are about 240 map tasks and I
have it set to run 10 map tasks per task tracker. With those settings, the
slaves are running about 100% CPU and memory is just about capacity but is
almost no paging. Although the tasks seem to be progressing, some of the
tasks that have just completed have run for 30+ hours. Some of the tasks
have failed with a "Lost task tracker:" which I intend on fixing with
HADOOP-3403_0_20080516.patch, whenever this job finishes.

It seemed to me that the problem might have been calling the collector so
many times since users can be in 1000's of groups and it does about n^2
comparisons. I tried another version which outputs only n times by having
each entry output a map, but this did not prove much better on the test
trials I ran, and the extra work in the reducer is really killer.

It is not clear to me what is dragging down this job, or what I can do to
increase the rate at which it is computing. Although there is quite a bit of
data, it doesnt seem like it should be taking this long on 20 nodes. Any
help/questions/comments would be greatly appreciated. Thanks for all of your
help.

-- 
View this message in context: 
http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17465721.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to