Do you know why the jobs are failing? Take a look at the logs. I suspect it may be due to s3, not hadoop.
-D On Tue, Jan 26, 2010 at 7:57 AM, prasenjit mukherjee <[email protected]> wrote: > Hi Mridul, > Thanks your approach works fine. This is how my current pig script > looks like : > > define CMD `s3fetch.py` SHIP('/root/s3fetch.py'); > r1 = LOAD '/ip/s3fetch_input_files' AS (filename:chararray); > grp_r1 = GROUP r1 BY filename PARALLEL 5; > r2 = FOREACH grp_r1 GENERATE FLATTEN(r1); > r3 = STREAM r2 through CMD; > store r3 INTO '/op/s3fetch_debug_log'; > > And here is my s3fetch.py : > for word in sys.stdin: > word=word.rstrip() > str='/usr/local/hadoop-0.20.0/bin/hadoop fs -cp > s3n://<s3-credentials>@bucket/dir-name/'+word+' /ip/data/.'; > sys.stdout.write('\n\n'+word+ ':\t'+str+'\n') > (input_str,out_err) = os.popen4(str); > for line in out_err.readlines(): > sys.stdout.write('\t'+word+'::\t'+line) > > > > So, the job starts fine and I see that my hadoop directory ( /ip/data/.) > starts filling up with s3 files. But after sometime it gets stuck. I see > lots of failed/restarted jobs in the jobtracker. And the number of files > dont increase in /ip/data. > > Could this be happening because of parallel hdfs writes ( via hadoop fs -cp > <> <> ) making primary-name-node a blocking server ? > > Any help is greatly appreciated. > > -Thanks, > Prasen > > On Mon, Jan 25, 2010 at 8:58 AM, Mridul Muralidharan > <[email protected]>wrote: > >> >> If each line from your file has to be processed by a different mapper - >> other than by writing a custom slicer, a very dirty hack would be to : >> a) create N number of files with one line each. >> b) Or, do something like : >> input_lines = load 'my_s3_list_file' as (location_line:chararray); >> grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED; >> actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group); >> >> >> The preferred way, as Dmitriy mentioned, would be to use a custom Slicer >> ofcourse ! >> >> Regards, >> Mridul >> >> >> prasenjit mukherjee wrote: >> >>> I want to use Pig to paralelize processing on a number of requests. There >>> are ~ 300 request which needs to be processed. Each processing consist of >>> following : >>> 1. Fetch file from s3 to local >>> 2. Do some preprocessing >>> 3. Put it into hdfs >>> >>> My input is a small file with 300 lines. The problem is that pig seems to >>> be >>> always creating a single mapper, because of which the load is not properly >>> distributed. Any way I can enforce splitting of smaller input files as >>> well >>> ? Below is the pig output which tends to indicate that there is only 1 >>> mapper. Let me know if my understanding is wrong. >>> >>> 2010-01-24 05:31:53,148 [main] INFO >>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >>> - MR plan size before optimization: 1 >>> 2010-01-24 05:31:53,148 [main] INFO >>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >>> - MR plan size after optimization: 1 >>> 2010-01-24 05:31:55,006 [main] INFO >>> >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >>> - Setting up single store job >>> >>> Thanks >>> -Prasen. >>> >> >> >
