It's been in the docs since 0.3 http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html Implicit Dependencies
If a script has dependencies on the execution order outside of what Pig knows about, execution may fail. For instance, in this script MYUDF might try to read from out1, a file that A was just stored into. However, Pig does not know that MYUDF depends on the out1 file and might submit the jobs producing the out2 and out1 files at the same time. ... STORE A INTO 'out1'; B = LOAD 'data2'; C = FOREACH B GENERATE MYUDF($0,'out1'); STORE C INTO 'out2'; To make the script work (to ensure that the right execution order is enforced) add the exec statement. The exec statement will trigger the execution of the statements that produce the out1 file. ... STORE A INTO 'out1'; EXEC; B = LOAD 'data2'; C = FOREACH B GENERATE MYUDF($0,'out1'); STORE C INTO 'out2'; On Tue, Feb 16, 2010 at 12:46 AM, Mridul Muralidharan <[email protected]> wrote: > > Is this documented behavior or current impl detail ? > A lot of scripts broke when multi-query optimization was committed to trunk > because of the implicit ordering assumption (based on STORE) in earlier pig > - which was, iirc, documented. > > > Regards, > Mridul > > On Thursday 11 February 2010 10:52 PM, Dmitriy Ryaboy wrote: >> >> EXEC will trigger execution of the code that precedes it. >> >> >> >> On Thu, Feb 11, 2010 at 9:12 AM, prasenjit mukherjee >> <[email protected]> wrote: >>> >>> Is there any way I can have a pig statement wait for a condition.This >>> is what I am trying to do : I am first creating and storing a >>> relation in pig, and then I want to upload that relation via >>> STREAM/DEFINE command. Here is the pig script I am tryign to write : >>> >>> ......... >>> STORE r1 INTO 'myoutput.data' >>> STREAM 'myfile_containing_output_dat.txt' THRUGH `upload.py` >>> >>> Any way I can acheive this ? >>> >>> -Prasen >>> > >
