Alan Gates
Thu, 12 Nov 2009 08:26:53 -0800
Others have suggested that we use the file size to specify the number of reducers. We cannot always assume the inputs are HDFS files (it could be from HBase or something). Also different storage formats (text, sequence files, zebra) would need different ratios of bytes to reducers since they store data at different compression rates. Maybe this could still work assuming, only in the HDFS case, with the assumption that the user understands the compression ratios and thus can set the reducer input accordingly. But I'm not sure this will be simple enough to be useful.
Thoughts? Alan. On Nov 12, 2009, at 12:12 AM, Jeff Zhang wrote:
Hi all,Often, I will run one script on different data set. Sometimes small data setand sometimes large data set. And different size of data set require different number of reducers.I know that the default reduce number is 1, and users can change the reducenumber in script by keywords parallel.But I do not want to be bothered to change reduce number in script each timeI run script.So I have an idea that could pig provide some API that users can set the ratio between map task and reduce task. (and some new keyword in pig latinto set the ratio)e.g. If I set the ratio to be 2:1, then if I have 100 map tasks, it willhave 50 reduce task accordingly. I think it will be convenient for pig users. Jeff Zhang