So I have a table with roughly 145,000 records spread across 300
files. The total size is about 7MB. Right now I'm running one job
tracker and one task tracker which is a high cpu amazon box (1.7 Gbits
of RAM, ~ 4 cores). I run the following query:
SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;
And it takes about 35 minutes to finish. One of my problems is that I
can't get my task tracker to process more than one map at a time even
though it has a higher number of maximum map tasks. But even that is
relatively fast compared to the reduce which takes about 30 minutes by
itself. The status of the task is:
reduce > copy (225 of 344 at 0.01 MB/s) >
I really don't understand what is going on during this copy step or
why it is taking so long. The files are small and they're all inside
of amazon's network. Can you guys help me out?
Josh F.