Hi Mridul,
    Thanks your approach  works fine. This is how my current pig script
looks like :

define CMD `s3fetch.py` SHIP('/root/s3fetch.py');
r1 = LOAD '/ip/s3fetch_input_files' AS (filename:chararray);
grp_r1 = GROUP r1 BY filename PARALLEL 5;
r2 = FOREACH grp_r1 GENERATE FLATTEN(r1);
r3 = STREAM r2 through CMD;
store r3 INTO '/op/s3fetch_debug_log';

And here is my s3fetch.py :
for word in sys.stdin:
  word=word.rstrip()
  str='/usr/local/hadoop-0.20.0/bin/hadoop fs -cp
s3n://<s3-credentials>@bucket/dir-name/'+word+' /ip/data/.';
  sys.stdout.write('\n\n'+word+ ':\t'+str+'\n')
  (input_str,out_err) = os.popen4(str);
  for line in out_err.readlines():
    sys.stdout.write('\t'+word+'::\t'+line)



So, the job starts fine and I see that my hadoop directory ( /ip/data/.)
starts filling up with s3 files. But after sometime it gets stuck. I see
lots of failed/restarted jobs  in the jobtracker. And the number of files
dont increase in /ip/data.

Could this be happening because of parallel hdfs writes ( via hadoop fs -cp
<> <> ) making primary-name-node a blocking server ?

Any help is greatly appreciated.

-Thanks,
Prasen

On Mon, Jan 25, 2010 at 8:58 AM, Mridul Muralidharan
<[email protected]>wrote:

>
> If each line from your file has to be processed by a different mapper -
> other than by writing a custom slicer, a very dirty hack would be to :
> a) create N number of files with one line each.
> b) Or, do something like :
> input_lines = load 'my_s3_list_file' as (location_line:chararray);
> grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED;
> actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);
>
>
> The preferred way, as Dmitriy mentioned, would be to use a custom Slicer
> ofcourse !
>
> Regards,
> Mridul
>
>
> prasenjit mukherjee wrote:
>
>> I want to use Pig to paralelize processing on a number of  requests. There
>> are ~ 300 request which needs to be  processed. Each processing consist of
>> following :
>> 1. Fetch file from s3 to local
>> 2. Do some preprocessing
>> 3. Put it into hdfs
>>
>> My input is a small file with 300 lines. The problem is that pig seems to
>> be
>> always creating a single mapper, because of which the load is not properly
>> distributed. Any way I can enforce splitting of smaller input files as
>> well
>> ? Below is the pig output which tends to indicate that there is only 1
>> mapper. Let me know if my understanding is wrong.
>>
>> 2010-01-24 05:31:53,148 [main] INFO
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> - MR plan size before optimization: 1
>> 2010-01-24 05:31:53,148 [main] INFO
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> - MR plan size after optimization: 1
>> 2010-01-24 05:31:55,006 [main] INFO
>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>> - Setting up single store job
>>
>> Thanks
>> -Prasen.
>>
>
>

Reply via email to