I just realized this was a hive question, I have no experience with Hive, so my advice is probably incorrect.
On Tue, Jan 27, 2009 at 8:13 AM, jason hadoop <[email protected]>wrote: > It is not clear to me fromyour email if you have the number of map tasks > per machine set to > 1, or if you are attempting to us a multi-threaded > mapper. > > How many tasks does the system split your job into? and how many execute at > once. > It is a first guess that you are getting 300 map tasks, and each runs for a > small number of seconds, and most of that time is probably the task setup > time. > > As a first try, you could try packing your 300 small files into as many > files as you have simultaneous task execution slots and adjust the input > split size (probably not necessary) to ensure there is no further splitting. > > The reduces all essentially stall until all of the map tasks are done, so > the reduce copy speed is a misleading value. > > > On Mon, Jan 26, 2009 at 11:27 PM, Josh Ferguson <[email protected]>wrote: > >> So I have a table with roughly 145,000 records spread across 300 files. >> The total size is about 7MB. Right now I'm running one job tracker and one >> task tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I >> run the following query: >> SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities; >> >> And it takes about 35 minutes to finish. One of my problems is that I >> can't get my task tracker to process more than one map at a time even though >> it has a higher number of maximum map tasks. But even that is relatively >> fast compared to the reduce which takes about 30 minutes by itself. The >> status of the task is: >> >> reduce > copy (225 of 344 at 0.01 MB/s) > >> >> I really don't understand what is going on during this copy step or why it >> is taking so long. The files are small and they're all inside of amazon's >> network. Can you guys help me out? >> >> Josh F. >> > >
