Bae, Jae Hyeon
Tue, 16 Mar 2010 06:16:21 -0700
You can refer to mapreduce documents. According to it, reasonable number of reducers is 0.95 * <maximum number of reducers per node> * <number of nodes>.
Actually, reducers are running in parallel with a number of mappers. They are copying output results of mappers and merging in background. After all mappers are finished, final merge sort is executed and actual reduce() functions are ready to be started. So, the best status is that reducing jobs including copying output results of mappers, merging and sorting those results and executing reduce() are finished in a single round. You don't have to run more reducers exceeding maximum capacity of hadoop clusters. In my opinion, 0.95 means that we should remain a few nodes to prepare machine's malfunctioning. 2010/3/16 Rob Stewart <robstewar...@googlemail.com>: > Hi, quick question: > > What is the origin of the formula for specifying the number of reducers for > a Pig job (using PARALLEL). > > I have it as: > <number of nodes> * <maximum number of reducers per node> * 0.9 > > Is there a theory or model behind this formula? > > Rob Stewart >