The reducer method is a pretty low-cost (in terms of developer time) workaround, I wouldn't make it too high of a priority. It seems like a throughput optimization at most and that's only for a certain class of mapper script that actually reduces the input set in some way.

Josh

On Jan 11, 2009, at 10:59 PM, Joydeep Sen Sarma wrote:

We should be able to control this (specify exact mapper count) once hadoop-4565 and hive-74 are resolved (these are being worked on actively).

From: Zheng Shao [mailto:zsh...@gmail.com]
Sent: Sunday, January 11, 2009 9:16 PM
To: hive-user@hadoop.apache.org
Subject: Re: Number of Mappers

Currently the only way to do it is to use a reducer.

set mapred.reduce.tasks=1;
SELECT TRANSFORM(actor_id) USING '/my/script' AS (actor_id, percentile, count) FROM (SELECT actor_id FROM activities CLUSTER BY actor_id) a; On Sun, Jan 11, 2009 at 8:45 PM, Josh Ferguson <j...@besquared.net> wrote:
If I'm running a query like this:

hive> SELECT TRANSFORM(actor_id) USING '/my/script' AS (actor_id, percentile, count) FROM activities;

It creates a map job for each file. I need every row that is in the table to be run through a single instance of the script since certain parts require global list information. Do I need to rework this query to use a reducer or can I change some configuration variable to load in all of my data from this table and run it through /my/script all at once?

Josh F.



--
Yours,
Zheng

Reply via email to