Richard Ding commented on PIG-978:

In Pig Latin Manual, this is called "Implicit Dependencies in Multi-Query 

*Implicit Dependencies*

If a script has dependencies on the execution order outside of what Pig knows 
about, execution may fail. For instance, in this script MYUDF might try to read 
from out1, a file that A was just stored into. However, Pig does not know that 
MYUDF depends on the out1 file and might submit the jobs producing the out2 and 
out1 files at the same time. To make the script work (to ensure that the right 
execution order is enforced) add the exec statement. The exec statement will 
trigger the execution of the statements that produce the out1 file.

The Pig script in this Jira shows another form of those "implicit dependencies" 
in multi-query scripts. Namely, the store/load operators have different file 
paths, but the load operator actually depends the store operator. An exec 
statement should be inserted between the store and load statements to ensure 
the right execution order is enforced.

> ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) 
> and ERROR 2999: (Unexpected internal error. null) when using Multi-Query 
> optimization
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: PIG-978
>                 URL: https://issues.apache.org/jira/browse/PIG-978
>             Project: Pig
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.6.0
> I have  Pig script of this form.. which I execute using Multi-query 
> optimization.
> {code}
> A = load '/user/viraj/firstinput' using PigStorage();
> B = group ....
> C = ..agrregation function
> store C into '/user/viraj/firstinputtempresult/days1';
> ..
> Atab = load '/user/viraj/secondinput' using PigStorage();
> Btab = group ....
> Ctab = ..agrregation function
> store Ctab into '/user/viraj/secondinputtempresult/days1';
> ..
> E = load '/user/viraj/firstinputtempresult/' using PigStorage();
> F = group 
> G = aggregation function
> store G into '/user/viraj/finalresult1';
> Etab = load '/user/viraj/secondinputtempresult/' using PigStorage();
> Ftab = group 
> Gtab = aggregation function
> store Gtab into '/user/viraj/finalresult2';
> {code}
> 2009-07-20 22:05:44,507 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
> ERROR 2100: hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist. 
> Details at logfile: /homes/viraj/pigscripts/pig_1248127173601.log)  
> is due to the mismatch of store/load commands. The script first stores files 
> into the 'days1' directory (store C into 
> '/user/viraj/firstinputtempresult/days1' using PigStorage();), but it later 
> loads from the top level directory (E = load 
> '/user/viraj/firstinputtempresult/' using PigStorage()) instead of the 
> original directory (/user/viraj/firstinputtempresult/days1).
> The current multi-query optimizer can't solve the dependency between these 
> two commands--they have different load file paths. So the jobs will run 
> concurrently and result in the errors.
> The solution is to add 'exec' or 'run' command after the first two stores . 
> This will force the first two store commands to run before the rest commands.
> It would be nice to see this fixed as a part of an enhancement to the 
> Multi-query. We either disable the Multi-query or throw a warning/error 
> message, so that the user can correct his load/store statements.
> Viraj

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to