It is not clear to me fromyour email if you have the number of map tasks per machine set to > 1, or if you are attempting to us a multi-threaded mapper.
How many tasks does the system split your job into? and how many execute at once. It is a first guess that you are getting 300 map tasks, and each runs for a small number of seconds, and most of that time is probably the task setup time. As a first try, you could try packing your 300 small files into as many files as you have simultaneous task execution slots and adjust the input split size (probably not necessary) to ensure there is no further splitting. The reduces all essentially stall until all of the map tasks are done, so the reduce copy speed is a misleading value. On Mon, Jan 26, 2009 at 11:27 PM, Josh Ferguson <[email protected]> wrote: > So I have a table with roughly 145,000 records spread across 300 files. The > total size is about 7MB. Right now I'm running one job tracker and one task > tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run > the following query: > SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities; > > And it takes about 35 minutes to finish. One of my problems is that I can't > get my task tracker to process more than one map at a time even though it > has a higher number of maximum map tasks. But even that is relatively fast > compared to the reduce which takes about 30 minutes by itself. The status of > the task is: > > reduce > copy (225 of 344 at 0.01 MB/s) > > > I really don't understand what is going on during this copy step or why it > is taking so long. The files are small and they're all inside of amazon's > network. Can you guys help me out? > > Josh F. >
