pig-user  

Re: enforcing number of mappers

prasenjit mukherjee
Sun, 24 Jan 2010 05:37:58 -0800

I am thinking of writing a 2 line pig script to do the job.
r1 = LOAD '/foo/*' USING PigStorage(' ') split by 'file';
stream r1 through `myscript`;

Thinking of using  "split by 'file'" pig-command. Basically if I can  split
a single input file into many ( via using unix-split). And then write a
simple script to do s3fetch/hdfs-put and then use the "stream" operator.


Will that work distribute the load  ? Any way ( debug/log etc. )  I can know
how many nodes were being used as mapper ?

-Prasen

On Sun, Jan 24, 2010 at 5:41 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:

> you need to write a custom slicer that will enforce your preferred
> strategy for determining # of mappers.
>
> Once the load/store redesign goes in, slicers will go away, and you
> will write custom hadoop partitioners instead.
> -D
>
> On Sun, Jan 24, 2010 at 2:45 AM, prasenjit mukherjee
> <prasen....@gmail.com> wrote:
> > I want to use Pig to paralelize processing on a number of  requests.
> There
> > are ~ 300 request which needs to be  processed. Each processing consist
> of
> > following :
> > 1. Fetch file from s3 to local
> > 2. Do some preprocessing
> > 3. Put it into hdfs
> >
> > My input is a small file with 300 lines. The problem is that pig seems to
> be
> > always creating a single mapper, because of which the load is not
> properly
> > distributed. Any way I can enforce splitting of smaller input files as
> well
> > ? Below is the pig output which tends to indicate that there is only 1
> > mapper. Let me know if my understanding is wrong.
> >
> > 2010-01-24 05:31:53,148 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > - MR plan size before optimization: 1
> > 2010-01-24 05:31:53,148 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > - MR plan size after optimization: 1
> > 2010-01-24 05:31:55,006 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> > - Setting up single store job
> >
> > Thanks
> > -Prasen.
> >
>