Re: distributing hdfs put

Ashutosh Chauhan Fri, 29 Jan 2010 20:30:58 -0800

You can set it through Pig as well as you have mentioned. Advantage is
that instead of setting permanently to high value through
hadoop-site.xml (which will then affect all subsequent hadoop jobs of
your cluster)  through Pig you can set it on per job basis.


Ashutosh

On Wed, Jan 27, 2010 at 21:55, prasenjit mukherjee
<[email protected]> wrote:
> Not sure I understand. Are you saying that pig takes -D<> parameters
> directly. Will the following  work :
>
> "pig -Dmapred.task.timeout=0 -f myfile.pig"
>
>
> On Thu, Jan 28, 2010 at 11:08 AM, Amogh Vasekar <[email protected]> wrote:
>
>> Hi,
>> You should be able to pass this as a cmd line argument using -D ... If you
>> want to change it for all jobs on your own cluster, it would be in
>> mapred-site.
>>
>> Amogh
>>
>>
>> On 1/28/10 11:03 AM, "prasenjit mukherjee" <[email protected]>
>> wrote:
>>
>> Thanks Amogh for your quick response. Changing this property only on
>> master's hadoop-site.xml will do or I need to do it on all the slaves as
>> well ?
>>
>> Any way I can do this from PIG ( or I guess I am asking too much here :) )
>>
>> On Thu, Jan 28, 2010 at 10:57 AM, Amogh Vasekar <[email protected]>
>> wrote:
>>
>> > Yes, parameter is mapred.task.timeout in mS.
>> > You can also update status / output to stdout after some time chunks to
>> > avoid this :)
>> >
>> > Amogh
>> >
>> >
>> > On 1/28/10 10:52 AM, "prasenjit mukherjee" <
>> [email protected]>
>> > wrote:
>> >
>> > Now I see. The tasks are failing with the following error message :
>> >
>> > *Task attempt_201001272359_0001_r_000000_0 failed to report status for
>> 600
>> > seconds. Killing!*
>> >
>> > Looks like hadoop kills/restarts  jobs which takes more than 600 seconds.
>> > Is
>> > there any way I can increase it to some very high number  ?
>> >
>> > -Thanks,
>> > Prasenjit
>> >
>> >
>> >
>> > On Tue, Jan 26, 2010 at 9:55 PM, Dmitriy Ryaboy <[email protected]>
>> > wrote:
>> > >
>> > > Do you know why the jobs are failing? Take a look at the logs. I
>> > > suspect it may be due to s3, not hadoop.
>> > >
>> > > -D
>> > >
>> > > On Tue, Jan 26, 2010 at 7:57 AM, prasenjit mukherjee
>> > > <[email protected]> wrote:
>> > > > Hi Mridul,
>> > > >    Thanks your approach  works fine. This is how my current pig
>> script
>> > > > looks like :
>> > > >
>> > > > define CMD `s3fetch.py` SHIP('/root/s3fetch.py');
>> > > > r1 = LOAD '/ip/s3fetch_input_files' AS (filename:chararray);
>> > > > grp_r1 = GROUP r1 BY filename PARALLEL 5;
>> > > > r2 = FOREACH grp_r1 GENERATE FLATTEN(r1);
>> > > > r3 = STREAM r2 through CMD;
>> > > > store r3 INTO '/op/s3fetch_debug_log';
>> > > >
>> > > > And here is my s3fetch.py :
>> > > > for word in sys.stdin:
>> > > >  word=word.rstrip()
>> > > >  str='/usr/local/hadoop-0.20.0/bin/hadoop fs -cp
>> > > > s3n://<s3-credentials>@bucket/dir-name/'+word+' /ip/data/.';
>> > > >  sys.stdout.write('\n\n'+word+ ':\t'+str+'\n')
>> > > >  (input_str,out_err) = os.popen4(str);
>> > > >  for line in out_err.readlines():
>> > > >    sys.stdout.write('\t'+word+'::\t'+line)
>> > > >
>> > > >
>> > > >
>> > > > So, the job starts fine and I see that my hadoop directory (
>> > /ip/data/.)
>> > > > starts filling up with s3 files. But after sometime it gets stuck. I
>> > see
>> > > > lots of failed/restarted jobs  in the jobtracker. And the number of
>> > files
>> > > > dont increase in /ip/data.
>> > > >
>> > > > Could this be happening because of parallel hdfs writes ( via hadoop
>> fs
>> > -cp
>> > > > <> <> ) making primary-name-node a blocking server ?
>> > > >
>> > > > Any help is greatly appreciated.
>> > > >
>> > > > -Thanks,
>> > > > Prasen
>> > > >
>> > > > On Mon, Jan 25, 2010 at 8:58 AM, Mridul Muralidharan
>> > > > <[email protected]>wrote:
>> > > >
>> > > >>
>> > > >> If each line from your file has to be processed by a different
>> mapper
>> > -
>> > > >> other than by writing a custom slicer, a very dirty hack would be to
>> :
>> > > >> a) create N number of files with one line each.
>> > > >> b) Or, do something like :
>> > > >> input_lines = load 'my_s3_list_file' as (location_line:chararray);
>> > > >> grp_op = GROUP input_lines BY location_line PARALLEL
>> > $NUM_MAPPERS_REQUIRED;
>> > > >> actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);
>> > > >>
>> > > >>
>> > > >> The preferred way, as Dmitriy mentioned, would be to use a custom
>> > Slicer
>> > > >> ofcourse !
>> > > >>
>> > > >> Regards,
>> > > >> Mridul
>> > > >>
>> > > >>
>> > > >> prasenjit mukherjee wrote:
>> > > >>
>> > > >>> I want to use Pig to paralelize processing on a number of
>>  requests.
>> > There
>> > > >>> are ~ 300 request which needs to be  processed. Each processing
>> > consist of
>> > > >>> following :
>> > > >>> 1. Fetch file from s3 to local
>> > > >>> 2. Do some preprocessing
>> > > >>> 3. Put it into hdfs
>> > > >>>
>> > > >>> My input is a small file with 300 lines. The problem is that pig
>> > seems
>> > to
>> > > >>> be
>> > > >>> always creating a single mapper, because of which the load is not
>> > properly
>> > > >>> distributed. Any way I can enforce splitting of smaller input files
>> > as
>> > > >>> well
>> > > >>> ? Below is the pig output which tends to indicate that there is
>> only
>> > 1
>> > > >>> mapper. Let me know if my understanding is wrong.
>> > > >>>
>> > > >>> 2010-01-24 05:31:53,148 [main] INFO
>> > > >>>
>> > > >>>
>> >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> > > >>> - MR plan size before optimization: 1
>> > > >>> 2010-01-24 05:31:53,148 [main] INFO
>> > > >>>
>> > > >>>
>> >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> > > >>> - MR plan size after optimization: 1
>> > > >>> 2010-01-24 05:31:55,006 [main] INFO
>> > > >>>
>> > > >>>
>> >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>> > > >>> - Setting up single store job
>> > > >>>
>> > > >>> Thanks
>> > > >>> -Prasen.
>> > > >>>
>> > > >>
>> > > >>
>> > > >
>> >
>> >
>>
>>
>

Re: distributing hdfs put

Reply via email to