Mridul Muralidharan
Sun, 24 Jan 2010 19:29:08 -0800
If each line from your file has to be processed by a different mapper - other than by writing a custom slicer, a very dirty hack would be to :
a) create N number of files with one line each. b) Or, do something like : input_lines = load 'my_s3_list_file' as (location_line:chararray); grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED; actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);The preferred way, as Dmitriy mentioned, would be to use a custom Slicer ofcourse !
Regards, Mridul prasenjit mukherjee wrote:
I want to use Pig to paralelize processing on a number of requests. There are ~ 300 request which needs to be processed. Each processing consist of following : 1. Fetch file from s3 to local 2. Do some preprocessing 3. Put it into hdfs My input is a small file with 300 lines. The problem is that pig seems to be always creating a single mapper, because of which the load is not properly distributed. Any way I can enforce splitting of smaller input files as well ? Below is the pig output which tends to indicate that there is only 1 mapper. Let me know if my understanding is wrong. 2010-01-24 05:31:53,148 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2010-01-24 05:31:53,148 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2010-01-24 05:31:55,006 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job Thanks -Prasen.