Re: distributing hdfs put

prasenjit mukherjee Wed, 27 Jan 2010 21:56:20 -0800

Not sure I understand. Are you saying that pig takes -D<> parameters
directly. Will the following  work :


"pig -Dmapred.task.timeout=0 -f myfile.pig"


On Thu, Jan 28, 2010 at 11:08 AM, Amogh Vasekar <[email protected]> wrote:

> Hi,
> You should be able to pass this as a cmd line argument using -D ... If you
> want to change it for all jobs on your own cluster, it would be in
> mapred-site.
>
> Amogh
>
>
> On 1/28/10 11:03 AM, "prasenjit mukherjee" <[email protected]>
> wrote:
>
> Thanks Amogh for your quick response. Changing this property only on
> master's hadoop-site.xml will do or I need to do it on all the slaves as
> well ?
>
> Any way I can do this from PIG ( or I guess I am asking too much here :) )
>
> On Thu, Jan 28, 2010 at 10:57 AM, Amogh Vasekar <[email protected]>
> wrote:
>
> > Yes, parameter is mapred.task.timeout in mS.
> > You can also update status / output to stdout after some time chunks to
> > avoid this :)
> >
> > Amogh
> >
> >
> > On 1/28/10 10:52 AM, "prasenjit mukherjee" <
> [email protected]>
> > wrote:
> >
> > Now I see. The tasks are failing with the following error message :
> >
> > *Task attempt_201001272359_0001_r_000000_0 failed to report status for
> 600
> > seconds. Killing!*
> >
> > Looks like hadoop kills/restarts  jobs which takes more than 600 seconds.
> > Is
> > there any way I can increase it to some very high number  ?
> >
> > -Thanks,
> > Prasenjit
> >
> >
> >
> > On Tue, Jan 26, 2010 at 9:55 PM, Dmitriy Ryaboy <[email protected]>
> > wrote:
> > >
> > > Do you know why the jobs are failing? Take a look at the logs. I
> > > suspect it may be due to s3, not hadoop.
> > >
> > > -D
> > >
> > > On Tue, Jan 26, 2010 at 7:57 AM, prasenjit mukherjee
> > > <[email protected]> wrote:
> > > > Hi Mridul,
> > > >    Thanks your approach  works fine. This is how my current pig
> script
> > > > looks like :
> > > >
> > > > define CMD `s3fetch.py` SHIP('/root/s3fetch.py');
> > > > r1 = LOAD '/ip/s3fetch_input_files' AS (filename:chararray);
> > > > grp_r1 = GROUP r1 BY filename PARALLEL 5;
> > > > r2 = FOREACH grp_r1 GENERATE FLATTEN(r1);
> > > > r3 = STREAM r2 through CMD;
> > > > store r3 INTO '/op/s3fetch_debug_log';
> > > >
> > > > And here is my s3fetch.py :
> > > > for word in sys.stdin:
> > > >  word=word.rstrip()
> > > >  str='/usr/local/hadoop-0.20.0/bin/hadoop fs -cp
> > > > s3n://<s3-credentials>@bucket/dir-name/'+word+' /ip/data/.';
> > > >  sys.stdout.write('\n\n'+word+ ':\t'+str+'\n')
> > > >  (input_str,out_err) = os.popen4(str);
> > > >  for line in out_err.readlines():
> > > >    sys.stdout.write('\t'+word+'::\t'+line)
> > > >
> > > >
> > > >
> > > > So, the job starts fine and I see that my hadoop directory (
> > /ip/data/.)
> > > > starts filling up with s3 files. But after sometime it gets stuck. I
> > see
> > > > lots of failed/restarted jobs  in the jobtracker. And the number of
> > files
> > > > dont increase in /ip/data.
> > > >
> > > > Could this be happening because of parallel hdfs writes ( via hadoop
> fs
> > -cp
> > > > <> <> ) making primary-name-node a blocking server ?
> > > >
> > > > Any help is greatly appreciated.
> > > >
> > > > -Thanks,
> > > > Prasen
> > > >
> > > > On Mon, Jan 25, 2010 at 8:58 AM, Mridul Muralidharan
> > > > <[email protected]>wrote:
> > > >
> > > >>
> > > >> If each line from your file has to be processed by a different
> mapper
> > -
> > > >> other than by writing a custom slicer, a very dirty hack would be to
> :
> > > >> a) create N number of files with one line each.
> > > >> b) Or, do something like :
> > > >> input_lines = load 'my_s3_list_file' as (location_line:chararray);
> > > >> grp_op = GROUP input_lines BY location_line PARALLEL
> > $NUM_MAPPERS_REQUIRED;
> > > >> actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);
> > > >>
> > > >>
> > > >> The preferred way, as Dmitriy mentioned, would be to use a custom
> > Slicer
> > > >> ofcourse !
> > > >>
> > > >> Regards,
> > > >> Mridul
> > > >>
> > > >>
> > > >> prasenjit mukherjee wrote:
> > > >>
> > > >>> I want to use Pig to paralelize processing on a number of
>  requests.
> > There
> > > >>> are ~ 300 request which needs to be  processed. Each processing
> > consist of
> > > >>> following :
> > > >>> 1. Fetch file from s3 to local
> > > >>> 2. Do some preprocessing
> > > >>> 3. Put it into hdfs
> > > >>>
> > > >>> My input is a small file with 300 lines. The problem is that pig
> > seems
> > to
> > > >>> be
> > > >>> always creating a single mapper, because of which the load is not
> > properly
> > > >>> distributed. Any way I can enforce splitting of smaller input files
> > as
> > > >>> well
> > > >>> ? Below is the pig output which tends to indicate that there is
> only
> > 1
> > > >>> mapper. Let me know if my understanding is wrong.
> > > >>>
> > > >>> 2010-01-24 05:31:53,148 [main] INFO
> > > >>>
> > > >>>
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > > >>> - MR plan size before optimization: 1
> > > >>> 2010-01-24 05:31:53,148 [main] INFO
> > > >>>
> > > >>>
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > > >>> - MR plan size after optimization: 1
> > > >>> 2010-01-24 05:31:55,006 [main] INFO
> > > >>>
> > > >>>
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> > > >>> - Setting up single store job
> > > >>>
> > > >>> Thanks
> > > >>> -Prasen.
> > > >>>
> > > >>
> > > >>
> > > >
> >
> >
>
>

Re: distributing hdfs put

Reply via email to