Dmitriy Ryaboy
Sun, 24 Jan 2010 04:12:24 -0800
you need to write a custom slicer that will enforce your preferred strategy for determining # of mappers.
Once the load/store redesign goes in, slicers will go away, and you will write custom hadoop partitioners instead. -D On Sun, Jan 24, 2010 at 2:45 AM, prasenjit mukherjee <prasen....@gmail.com> wrote: > I want to use Pig to paralelize processing on a number of  requests. There > are ~ 300 request which needs to be  processed. Each processing consist of > following : > 1. Fetch file from s3 to local > 2. Do some preprocessing > 3. Put it into hdfs > > My input is a small file with 300 lines. The problem is that pig seems to be > always creating a single mapper, because of which the load is not properly > distributed. Any way I can enforce splitting of smaller input files as well > ? Below is the pig output which tends to indicate that there is only 1 > mapper. Let me know if my understanding is wrong. > > 2010-01-24 05:31:53,148 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2010-01-24 05:31:53,148 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2010-01-24 05:31:55,006 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > > Thanks > -Prasen. >