[ 
https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860614#action_12860614
 ] 

Ashutosh Chauhan commented on PIG-1211:
---------------------------------------

Oh, I got confused. From your earlier comment, it occurred to me you are saying 
that we should add a -checkscript command line option. From your previous 
comment are you suggesting that we should add syntax checker which will always 
run (i.e., without needing any cmd line directive) before the query starts to 
execute and thereby catching as many user error as possible. I think this is a 
reasonable ask and will be useful to users. This might be the first step 
towards making a distinction between pig compile time and run-time explicit to 
user. If we go full length here, we might as well do what Milind suggested 
earlier (and in recent mail thread). We can add a "compilation" phase which 
first runs a syntax checker, then generates "object code" (essentially job jar) 
from pig script. This compiled object can then be handed over to run-time 
(hadoop cluster). Wow, pig-latin is evolving towards a "true language" :)   

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
> col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the 
> first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of 
> explain to get the same syntax error?  In this way I can ensure that I do not 
> run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the 
> store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to