prasenjit mukherjee
Sun, 24 Jan 2010 02:46:07 -0800
I want to use Pig to paralelize processing on a number of requests. There are ~ 300 request which needs to be processed. Each processing consist of following : 1. Fetch file from s3 to local 2. Do some preprocessing 3. Put it into hdfs
My input is a small file with 300 lines. The problem is that pig seems to be always creating a single mapper, because of which the load is not properly distributed. Any way I can enforce splitting of smaller input files as well ? Below is the pig output which tends to indicate that there is only 1 mapper. Let me know if my understanding is wrong. 2010-01-24 05:31:53,148 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2010-01-24 05:31:53,148 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2010-01-24 05:31:55,006 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job Thanks -Prasen.