I have a pig script which does the following : 1) Loads the data into a relation 2) Splits the relation data into n sets based on some conditions 3) Runs some processing on each of these n-splits and creates 2 outputs from each split.
And then from another node I upload the result ( the 2*n relations ) to s3. Since n could be ~ 10s, I was hoping to use pig's DEFINE command to distribute the uploading of these 2*n relations. Now only if I could figure out a way to wait when processing of each of the n splits are done, I can kick-off the uploading for that split. I am currently trying to do synchronization by a separate set of pig scripts using a custom python code which waits till it sees the file in hdfs. Thanks, -Prasen On Tue, Feb 16, 2010 at 3:00 PM, Dmitriy Ryaboy <[email protected]> wrote: > Not that I am aware of, perhaps one of the others can chime in. > > What are you trying to unblock though? The only action that I see happening > after your exec statement immediately requires out1 to have been generated. > > Do you also have some D that you generate independently of out1, but > deriving it from B? If yes, but B=load 'data2'; D= FILTER B (or whaterver) > before the exec. > > Better yet -- avoid such dependencies :-). > > -D > > On Tue, Feb 16, 2010 at 1:21 AM, prasenjit mukherjee < > [email protected]> wrote: > >> Still confusing. So the entire execution ( from B = LOAD ..... >> onwards ) will be blocked till 'out1' is stored ? >> >> > STORE A INTO 'out1'; >> > EXEC; >> > B = LOAD 'data2'; >> > C = FOREACH B GENERATE MYUDF($0,'out1'); >> > STORE C INTO 'out2'; >> >> Any way to restrict it to a particular block ? >> >> -Prasen >> >> >> On Tue, Feb 16, 2010 at 2:35 PM, Dmitriy Ryaboy <[email protected]> >> wrote: >> > It's been in the docs since 0.3 >> > >> > http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html >> > Implicit Dependencies >> > >> > If a script has dependencies on the execution order outside of what Pig >> > knows about, execution may fail. For instance, in this script MYUDF might >> > try to read from out1, a file that A was just stored into. However, Pig >> does >> > not know that MYUDF depends on the out1 file and might submit the jobs >> > producing the out2 and out1 files at the same time. >> > >> > ... >> > STORE A INTO 'out1'; >> > B = LOAD 'data2'; >> > C = FOREACH B GENERATE MYUDF($0,'out1'); >> > STORE C INTO 'out2'; >> > >> > To make the script work (to ensure that the right execution order is >> > enforced) add the exec statement. The exec statement will trigger the >> > execution of the statements that produce the out1 file. >> > >> > ... >> > STORE A INTO 'out1'; >> > EXEC; >> > B = LOAD 'data2'; >> > C = FOREACH B GENERATE MYUDF($0,'out1'); >> > STORE C INTO 'out2'; >> > >> > >> > >> > On Tue, Feb 16, 2010 at 12:46 AM, Mridul Muralidharan < >> [email protected]> >> > wrote: >> >> >> >> Is this documented behavior or current impl detail ? >> >> A lot of scripts broke when multi-query optimization was committed to >> > trunk >> >> because of the implicit ordering assumption (based on STORE) in earlier >> > pig >> >> - which was, iirc, documented. >> >> >> >> >> >> Regards, >> >> Mridul >> >> >> >> On Thursday 11 February 2010 10:52 PM, Dmitriy Ryaboy wrote: >> >>> >> >>> EXEC will trigger execution of the code that precedes it. >> >>> >> >>> >> >>> >> >>> On Thu, Feb 11, 2010 at 9:12 AM, prasenjit mukherjee >> >>> <[email protected]> wrote: >> >>>> >> >>>> Is there any way I can have a pig statement wait for a condition.This >> >>>> is what I am trying to do : I am first creating and storing a >> >>>> relation in pig, and then I want to upload that relation via >> >>>> STREAM/DEFINE command. Here is the pig script I am tryign to write : >> >>>> >> >>>> ......... >> >>>> STORE r1 INTO 'myoutput.data' >> >>>> STREAM 'myfile_containing_output_dat.txt' THRUGH `upload.py` >> >>>> >> >>>> Any way I can acheive this ? >> >>>> >> >>>> -Prasen >> >>>> >> >> >> >> >> > >> >
