On Nov 12, 2009, at 10:47 AM, Benjamin Reed wrote:
actually i believe that the same conf object that goes to the
InputFormat is also the one that gets sent to the jobtracker, so if
we update the reducers in that conf object it should also change the
number of reduces that get sent to the jobtracker.
No, it's a copy. Changes made in it don't end up affecting the job.
Alan.
ben
Alan Gates wrote:
I agree that it would be very useful to have a dynamic number of
reducers. However, I'm not sure how to accomplish it. MapReduce
requires that we set the number of reducers up front in JobConf,
when we submit the job. But we don't know the number of maps
until getSplits is called after job submission. I don't think MR
will allow us to set the number of reducers once the job is started.
Others have suggested that we use the file size to specify the
number of reducers. We cannot always assume the inputs are HDFS
files (it could be from HBase or something). Also different
storage formats (text, sequence files, zebra) would need different
ratios of bytes to reducers since they store data at different
compression rates. Maybe this could still work assuming, only in
the HDFS case, with the assumption that the user understands the
compression ratios and thus can set the reducer input
accordingly. But I'm not sure this will be simple enough to be
useful.
Thoughts?
Alan.
On Nov 12, 2009, at 12:12 AM, Jeff Zhang wrote:
Hi all,
Often, I will run one script on different data set. Sometimes
small data set
and sometimes large data set. And different size of data set require
different number of reducers.
I know that the default reduce number is 1, and users can change
the reduce
number in script by keywords parallel.
But I do not want to be bothered to change reduce number in
script each time
I run script.
So I have an idea that could pig provide some API that users can
set the
ratio between map task and reduce task. (and some new keyword in
pig latin
to set the ratio)
e.g. If I set the ratio to be 2:1, then if I have 100 map tasks,
it will
have 50 reduce task accordingly.
I think it will be convenient for pig users.
Jeff Zhang