ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) 
and ERROR 2999: (Unexpected internal error. null) when using Multi-Query 
optimization
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: PIG-978
                 URL: https://issues.apache.org/jira/browse/PIG-978
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.6.0
            Reporter: Viraj Bhat
             Fix For: 0.6.0


I have  Pig script of this form.. which I execute using Multi-query 
optimization.

{code}
A = load '/user/viraj/firstinput' using PigStorage();
B = group ....
C = ..agrregation function
store C into '/user/viraj/firstinputtempresult/days1';
..
Atab = load '/user/viraj/secondinput' using PigStorage();
Btab = group ....
Ctab = ..agrregation function
store Ctab into '/user/viraj/secondinputtempresult/days1';
..
E = load '/user/viraj/firstinputtempresult/' using PigStorage();
F = group 
G = aggregation function
store G into '/user/viraj/finalresult1';

Etab = load '/user/viraj/secondinputtempresult/' using PigStorage();
Ftab = group 
Gtab = aggregation function
store Gtab into '/user/viraj/finalresult2';
{code}


2009-07-20 22:05:44,507 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
ERROR 2100: hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist. 
Details at logfile: /homes/viraj/pigscripts/pig_1248127173601.log)  

is due to the mismatch of store/load commands. The script first stores files 
into the 'days1' directory (store C into 
'/user/viraj/firstinputtempresult/days1' using PigStorage();), but it later 
loads from the top level directory (E = load 
'/user/viraj/firstinputtempresult/' using PigStorage()) instead of the original 
directory (/user/viraj/firstinputtempresult/days1).

The current multi-query optimizer can't solve the dependency between these two 
commands--they have different load file paths. So the jobs will run 
concurrently and result in the errors.

The solution is to add 'exec' or 'run' command after the first two stores . 
This will force the first two store commands to run before the rest commands.

It would be nice to see this fixed as a part of an enhancement to the 
Multi-query. We either disable the Multi-query or throw a warning/error 
message, so that the user can correct his load/store statements.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to