The reducer method is a pretty low-cost (in terms of developer time)
workaround, I wouldn't make it too high of a priority. It seems like a
throughput optimization at most and that's only for a certain class of
mapper script that actually reduces the input set in some way.
Josh
On Jan 11, 2009, at 10:59 PM, Joydeep Sen Sarma wrote:
We should be able to control this (specify exact mapper count) once
hadoop-4565 and hive-74 are resolved (these are being worked on
actively).
From: Zheng Shao [mailto:zsh...@gmail.com]
Sent: Sunday, January 11, 2009 9:16 PM
To: hive-user@hadoop.apache.org
Subject: Re: Number of Mappers
Currently the only way to do it is to use a reducer.
set mapred.reduce.tasks=1;
SELECT TRANSFORM(actor_id) USING '/my/script' AS (actor_id,
percentile, count) FROM (SELECT actor_id FROM activities CLUSTER BY
actor_id) a;
On Sun, Jan 11, 2009 at 8:45 PM, Josh Ferguson <j...@besquared.net>
wrote:
If I'm running a query like this:
hive> SELECT TRANSFORM(actor_id) USING '/my/script' AS (actor_id,
percentile, count) FROM activities;
It creates a map job for each file. I need every row that is in the
table to be run through a single instance of the script since
certain parts require global list information. Do I need to rework
this query to use a reducer or can I change some configuration
variable to load in all of my data from this table and run it
through /my/script all at once?
Josh F.
--
Yours,
Zheng