I have a pig script which does the following :
1) Loads the data into a relation
2) Splits the relation data into n sets based on some conditions
3) Runs some processing on each of these n-splits and creates 2
outputs from each split.

And then from another node I upload the result  ( the 2*n relations )
to s3.  Since n could be ~ 10s,  I was hoping to use pig's DEFINE
command to distribute the uploading of these 2*n relations. Now only
if I could figure out a way to  wait when processing of each of the n
splits are done, I can kick-off the uploading for that split.

I am currently trying to do synchronization by a separate set of pig
scripts using a custom python code which waits till it sees the file
in hdfs.

Thanks,
-Prasen

On Tue, Feb 16, 2010 at 3:00 PM, Dmitriy Ryaboy <[email protected]> wrote:
> Not that I am aware of, perhaps one of the others can chime in.
>
> What are you trying to unblock though? The only action that I see happening
> after your exec statement immediately requires out1 to have been generated.
>
> Do you also have some D that you generate independently of out1, but
> deriving it from B? If yes, but B=load 'data2'; D= FILTER B (or whaterver)
> before the exec.
>
> Better yet -- avoid such dependencies :-).
>
> -D
>
> On Tue, Feb 16, 2010 at 1:21 AM, prasenjit mukherjee <
> [email protected]> wrote:
>
>> Still confusing.  So the entire execution ( from B = LOAD .....
>> onwards ) will be blocked till 'out1' is stored ?
>>
>> > STORE A INTO 'out1';
>> > EXEC;
>> > B = LOAD 'data2';
>> > C = FOREACH B GENERATE MYUDF($0,'out1');
>> > STORE C INTO 'out2';
>>
>> Any way to restrict it to a particular block ?
>>
>> -Prasen
>>
>>
>> On Tue, Feb 16, 2010 at 2:35 PM, Dmitriy Ryaboy <[email protected]>
>> wrote:
>> > It's been in the docs since 0.3
>> >
>> > http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html
>> > Implicit Dependencies
>> >
>> > If a script has dependencies on the execution order outside of what Pig
>> > knows about, execution may fail. For instance, in this script MYUDF might
>> > try to read from out1, a file that A was just stored into. However, Pig
>> does
>> > not know that MYUDF depends on the out1 file and might submit the jobs
>> > producing the out2 and out1 files at the same time.
>> >
>> > ...
>> > STORE A INTO 'out1';
>> > B = LOAD 'data2';
>> > C = FOREACH B GENERATE MYUDF($0,'out1');
>> > STORE C INTO 'out2';
>> >
>> > To make the script work (to ensure that the right execution order is
>> > enforced) add the exec statement. The exec statement will trigger the
>> > execution of the statements that produce the out1 file.
>> >
>> > ...
>> > STORE A INTO 'out1';
>> > EXEC;
>> > B = LOAD 'data2';
>> > C = FOREACH B GENERATE MYUDF($0,'out1');
>> > STORE C INTO 'out2';
>> >
>> >
>> >
>> > On Tue, Feb 16, 2010 at 12:46 AM, Mridul Muralidharan <
>> [email protected]>
>> > wrote:
>> >>
>> >> Is this documented behavior or current impl detail ?
>> >> A lot of scripts broke when multi-query optimization was committed to
>> > trunk
>> >> because of the implicit ordering assumption (based on STORE) in earlier
>> > pig
>> >> - which was, iirc, documented.
>> >>
>> >>
>> >> Regards,
>> >> Mridul
>> >>
>> >> On Thursday 11 February 2010 10:52 PM, Dmitriy Ryaboy wrote:
>> >>>
>> >>> EXEC will trigger execution of the code that precedes it.
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Feb 11, 2010 at 9:12 AM, prasenjit mukherjee
>> >>> <[email protected]>  wrote:
>> >>>>
>> >>>> Is there any way I can have a pig statement wait for a condition.This
>> >>>> is what I am trying to do :  I am first creating and storing a
>> >>>> relation in pig, and then I want to upload that relation via
>> >>>> STREAM/DEFINE command. Here is the pig script I am tryign to write :
>> >>>>
>> >>>> .........
>> >>>> STORE r1 INTO 'myoutput.data'
>> >>>> STREAM 'myfile_containing_output_dat.txt' THRUGH `upload.py`
>> >>>>
>> >>>> Any way I can acheive this ?
>> >>>>
>> >>>> -Prasen
>> >>>>
>> >>
>> >>
>> >
>>
>

Reply via email to