prasenjit mukherjee
Sun, 24 Jan 2010 05:37:58 -0800
I am thinking of writing a 2 line pig script to do the job.
r1 = LOAD '/foo/*' USING PigStorage(' ') split by 'file';
stream r1 through `myscript`;
Thinking of using "split by 'file'" pig-command. Basically if I can split a single input file into many ( via using unix-split). And then write a simple script to do s3fetch/hdfs-put and then use the "stream" operator. Will that work distribute the load ? Any way ( debug/log etc. ) I can know how many nodes were being used as mapper ? -Prasen On Sun, Jan 24, 2010 at 5:41 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > you need to write a custom slicer that will enforce your preferred > strategy for determining # of mappers. > > Once the load/store redesign goes in, slicers will go away, and you > will write custom hadoop partitioners instead. > -D > > On Sun, Jan 24, 2010 at 2:45 AM, prasenjit mukherjee > <prasen....@gmail.com> wrote: > > I want to use Pig to paralelize processing on a number of requests. > There > > are ~ 300 request which needs to be processed. Each processing consist > of > > following : > > 1. Fetch file from s3 to local > > 2. Do some preprocessing > > 3. Put it into hdfs > > > > My input is a small file with 300 lines. The problem is that pig seems to > be > > always creating a single mapper, because of which the load is not > properly > > distributed. Any way I can enforce splitting of smaller input files as > well > > ? Below is the pig output which tends to indicate that there is only 1 > > mapper. Let me know if my understanding is wrong. > > > > 2010-01-24 05:31:53,148 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > > - MR plan size before optimization: 1 > > 2010-01-24 05:31:53,148 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > > - MR plan size after optimization: 1 > > 2010-01-24 05:31:55,006 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > > - Setting up single store job > > > > Thanks > > -Prasen. > > >