yeah so I am loading 344 files each one taking just under 1 second according
to the log, which takes approximately 5 minutes. The other 30 minutes are
spent doing a "reduce > copy". I'm not sure why it's so slow because it's
copying about 144,000 small records, the total size is about 16MB after it's
mapped. I think with this particular query the slowness could be caused by
the reduce task itself being slow? It's a distinct count so perhaps the
reducer code is running extremely slow? I will try to write my own tonight
and see if it goes any faster.
Josh F.
On Tue, Jan 27, 2009 at 8:34 AM, Joydeep Sen Sarma <[email protected]>wrote:

>  Hi Josh,
>
>
>
> Copying large number small map outputs can take a while. Can't say why the
> tasktracker is not running more than one mapper.
>
>
>
> We are working on this. hadoop-4565 tracks a jira to create splits that
> cross files while preserving locality. Hive-74 will use 4565 on hive side to
> control number of maps better.
>
>
>
> Joydeep
>
>
>  ------------------------------
>
> *From:* Josh Ferguson [mailto:[email protected]]
> *Sent:* Monday, January 26, 2009 11:28 PM
> *To:* [email protected]
> *Subject:* Job Speed
>
>
>
> So I have a table with roughly 145,000 records spread across 300 files. The
> total size is about 7MB. Right now I'm running one job tracker and one task
> tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run
> the following query:
>
>
>
> SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;
>
>
>
> And it takes about 35 minutes to finish. One of my problems is that I can't
> get my task tracker to process more than one map at a time even though it
> has a higher number of maximum map tasks. But even that is relatively fast
> compared to the reduce which takes about 30 minutes by itself. The status of
> the task is:
>
>
>
> reduce > copy (225 of 344 at 0.01 MB/s) >
>
>
>
>   I really don't understand what is going on during this copy step or why
> it is taking so long. The files are small and they're all inside of amazon's
> network. Can you guys help me out?
>
>
>
>   Josh F.
>

Reply via email to