Re: distributing hdfs put

Dmitriy Ryaboy Tue, 26 Jan 2010 08:25:31 -0800

Do you know why the jobs are failing? Take a look at the logs. I
suspect it may be due to s3, not hadoop.


-D

On Tue, Jan 26, 2010 at 7:57 AM, prasenjit mukherjee
<[email protected]> wrote:
> Hi Mridul,
>    Thanks your approach  works fine. This is how my current pig script
> looks like :
>
> define CMD `s3fetch.py` SHIP('/root/s3fetch.py');
> r1 = LOAD '/ip/s3fetch_input_files' AS (filename:chararray);
> grp_r1 = GROUP r1 BY filename PARALLEL 5;
> r2 = FOREACH grp_r1 GENERATE FLATTEN(r1);
> r3 = STREAM r2 through CMD;
> store r3 INTO '/op/s3fetch_debug_log';
>
> And here is my s3fetch.py :
> for word in sys.stdin:
>  word=word.rstrip()
>  str='/usr/local/hadoop-0.20.0/bin/hadoop fs -cp
> s3n://<s3-credentials>@bucket/dir-name/'+word+' /ip/data/.';
>  sys.stdout.write('\n\n'+word+ ':\t'+str+'\n')
>  (input_str,out_err) = os.popen4(str);
>  for line in out_err.readlines():
>    sys.stdout.write('\t'+word+'::\t'+line)
>
>
>
> So, the job starts fine and I see that my hadoop directory ( /ip/data/.)
> starts filling up with s3 files. But after sometime it gets stuck. I see
> lots of failed/restarted jobs  in the jobtracker. And the number of files
> dont increase in /ip/data.
>
> Could this be happening because of parallel hdfs writes ( via hadoop fs -cp
> <> <> ) making primary-name-node a blocking server ?
>
> Any help is greatly appreciated.
>
> -Thanks,
> Prasen
>
> On Mon, Jan 25, 2010 at 8:58 AM, Mridul Muralidharan
> <[email protected]>wrote:
>
>>
>> If each line from your file has to be processed by a different mapper -
>> other than by writing a custom slicer, a very dirty hack would be to :
>> a) create N number of files with one line each.
>> b) Or, do something like :
>> input_lines = load 'my_s3_list_file' as (location_line:chararray);
>> grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED;
>> actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);
>>
>>
>> The preferred way, as Dmitriy mentioned, would be to use a custom Slicer
>> ofcourse !
>>
>> Regards,
>> Mridul
>>
>>
>> prasenjit mukherjee wrote:
>>
>>> I want to use Pig to paralelize processing on a number of  requests. There
>>> are ~ 300 request which needs to be  processed. Each processing consist of
>>> following :
>>> 1. Fetch file from s3 to local
>>> 2. Do some preprocessing
>>> 3. Put it into hdfs
>>>
>>> My input is a small file with 300 lines. The problem is that pig seems to
>>> be
>>> always creating a single mapper, because of which the load is not properly
>>> distributed. Any way I can enforce splitting of smaller input files as
>>> well
>>> ? Below is the pig output which tends to indicate that there is only 1
>>> mapper. Let me know if my understanding is wrong.
>>>
>>> 2010-01-24 05:31:53,148 [main] INFO
>>>
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>> - MR plan size before optimization: 1
>>> 2010-01-24 05:31:53,148 [main] INFO
>>>
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>> - MR plan size after optimization: 1
>>> 2010-01-24 05:31:55,006 [main] INFO
>>>
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>>> - Setting up single store job
>>>
>>> Thanks
>>> -Prasen.
>>>
>>
>>
>

Re: distributing hdfs put

Reply via email to