yeah so I am loading 344 files each one taking just under 1 second according to the log, which takes approximately 5 minutes. The other 30 minutes are spent doing a "reduce > copy". I'm not sure why it's so slow because it's copying about 144,000 small records, the total size is about 16MB after it's mapped. I think with this particular query the slowness could be caused by the reduce task itself being slow? It's a distinct count so perhaps the reducer code is running extremely slow? I will try to write my own tonight and see if it goes any faster. Josh F. On Tue, Jan 27, 2009 at 8:34 AM, Joydeep Sen Sarma <[email protected]>wrote:
> Hi Josh, > > > > Copying large number small map outputs can take a while. Can't say why the > tasktracker is not running more than one mapper. > > > > We are working on this. hadoop-4565 tracks a jira to create splits that > cross files while preserving locality. Hive-74 will use 4565 on hive side to > control number of maps better. > > > > Joydeep > > > ------------------------------ > > *From:* Josh Ferguson [mailto:[email protected]] > *Sent:* Monday, January 26, 2009 11:28 PM > *To:* [email protected] > *Subject:* Job Speed > > > > So I have a table with roughly 145,000 records spread across 300 files. The > total size is about 7MB. Right now I'm running one job tracker and one task > tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run > the following query: > > > > SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities; > > > > And it takes about 35 minutes to finish. One of my problems is that I can't > get my task tracker to process more than one map at a time even though it > has a higher number of maximum map tasks. But even that is relatively fast > compared to the reduce which takes about 30 minutes by itself. The status of > the task is: > > > > reduce > copy (225 of 344 at 0.01 MB/s) > > > > > I really don't understand what is going on during this copy step or why > it is taking so long. The files are small and they're all inside of amazon's > network. Can you guys help me out? > > > > Josh F. >
