pig-user  

Re: enforcing number of mappers

Mridul Muralidharan
Sun, 24 Jan 2010 19:29:08 -0800


If each line from your file has to be processed by a different mapper - other than by writing a custom slicer, a very dirty hack would be to :
a) create N number of files with one line each.
b) Or, do something like :
input_lines = load 'my_s3_list_file' as (location_line:chararray);
grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED;
actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);


The preferred way, as Dmitriy mentioned, would be to use a custom Slicer ofcourse !

Regards,
Mridul

prasenjit mukherjee wrote:
I want to use Pig to paralelize processing on a number of  requests. There
are ~ 300 request which needs to be  processed. Each processing consist of
following :
1. Fetch file from s3 to local
2. Do some preprocessing
3. Put it into hdfs

My input is a small file with 300 lines. The problem is that pig seems to be
always creating a single mapper, because of which the load is not properly
distributed. Any way I can enforce splitting of smaller input files as well
? Below is the pig output which tends to indicate that there is only 1
mapper. Let me know if my understanding is wrong.

2010-01-24 05:31:53,148 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2010-01-24 05:31:53,148 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2010-01-24 05:31:55,006 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job

Thanks
-Prasen.