[jira] Created: (PIG-1660) Consider passing result of COUNT/COUNT_STAR to LIMIT

2010-09-30 Thread Viraj Bhat (JIRA)
Consider passing result of COUNT/COUNT_STAR to LIMIT 
-

 Key: PIG-1660
 URL: https://issues.apache.org/jira/browse/PIG-1660
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Viraj Bhat
 Fix For: 0.9.0


In realistic scenarios we need to split a dataset into segments by using LIMIT, 
and like to achieve that goal within the same pig script. Here is a case:

{code}
A = load '$DATA' using PigStorage(',') as (id, pvs);
B = group A by ALL;
C = foreach B generate COUNT_STAR(A) as row_cnt;
-- get the low 50% segment
D = order A by pvs;
E = limit D (C.row_cnt * 0.2);
store E in '$Eoutput';
-- get the high 20% segment
F = order A by pvs DESC;
G = limit F (C.row_cnt * 0.2);
store G in '$Goutput';
{code}

Since LIMIT only accepts constants, we have to split the operation to two steps 
in order to pass in the constants for the LIMIT statements. Please consider 
bringing this feature in so the processing can be more efficient.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1634) Multiple names for the "group" field

2010-09-20 Thread Viraj Bhat (JIRA)
Multiple names for the "group" field


 Key: PIG-1634
 URL: https://issues.apache.org/jira/browse/PIG-1634
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0
Reporter: Viraj Bhat


I am hoping that in Pig if I type 

{quote} c = cogroup a by foo, b by bar", the fields c.group, c.foo  and c.bar 
should all map to c.$0 {quote} 

This would improve the readability  of the Pig script.

Here's a real usecase:
{code}
---
pages = LOAD 'pages.dat'  AS (url, pagerank);

visits = LOAD 'user_log.dat'  AS (user_id, url);

page_visits = COGROUP pages BY url, visits BY url;

frequent_visits = FILTER page_visits BY COUNT(visits) >= 2;

answer = FOREACH frequent_visits  GENERATE url, FLATTEN(pages.pagerank);
---
{code}

(The important part is the final GENERATE statement, which references   the 
field "url", which was the grouping field in the earlier COGROUP.)  To get it  
to work I have to write it in a less intuitive way.

Maybe with the new parser changes in Pig 0.9 it would be easier to specify that.
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1633) Using an alias withing Nested Foreach causes indeterminate behaviour

2010-09-20 Thread Viraj Bhat (JIRA)
Using an alias withing Nested Foreach causes indeterminate behaviour


 Key: PIG-1633
 URL: https://issues.apache.org/jira/browse/PIG-1633
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0
Reporter: Viraj Bhat


I have created a RANDOMINT function which generates random numbers between (0 
and specified value), For example RANDOMINT(4) gives random numbers between 0 
and 3 (inclusive)

{code}
$hadoop fs -cat rand.dat
f
g
h
i
j
k
l
m
{code}

The pig script is as follows:
{code}
register math.jar;
A = load 'rand.dat' using PigStorage() as (data);

B = foreach A {
r = math.RANDOMINT(4);
generate
data,
r as random,
((r == 3)?1:0) as quarter;
};

dump B;
{code}

The results are as follows:
{code}
{color:red} 
(f,0,0)
(g,3,0)
(h,0,0)
(i,2,0)
(j,3,0)
(k,2,0)
(l,0,1)
(m,1,0)
{color} 
{code}

If you observe, (j,3,0) is created because r is used both in the foreach and 
generate clauses and generate different values.

Modifying the above script to below solves the issue. The M/R jobs from both 
scripts are the same. It is just a matter of convenience. 
{code}
A = load 'rand.dat' using PigStorage() as (data);

B = foreach A generate
data,
math.RANDOMINT(4) as r;

C = foreach B generate
data,
r,
((r == 3)?1:0) as quarter;

dump C;
{code}

Is this issue related to PIG:747?
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1631) Support to 2 level nested foreach

2010-09-20 Thread Viraj Bhat (JIRA)
Support to 2 level nested foreach
-

 Key: PIG-1631
 URL: https://issues.apache.org/jira/browse/PIG-1631
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Viraj Bhat


What I would like to do is generate certain metrics for every listing 
impression in the context of a page like clicks on the page etc. So, I first 
group by to get clicks and impression together. Now, I would want to iterate 
through the mini-table (one per serve-id) and compute metrics. Since nested 
foreach within foreach is not supported I ended up writing a UDF that took both 
the bags and computed the metric. It would have been elegant to keep the logic 
of iterating over the records outside in the PIG script. 

Here is some pseudocode of how I would have liked to write it:

{code}
-- Let us say in our page context there was click on rank 2 for which there 
were 3 ads 
A1 = LOAD '...' AS (page_id, rank); -- clicks. 
A2 = Load '...' AS (page_id, rank); -- impressions

B = COGROUP A1 by (page_id), A2 by (page_id); 

-- Let us say B contains the following schema 
-- (group, {(A1...)} {(A2...)})  
-- Each record would be in B would be:
-- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))}

C = FOREACH B GENERATE {
D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current pig 
as well. Basically, I would like a mini-table which represents an entire serve. 
FOREACH D GENERATE
page_id_1,
A2:rank,
SOMEUDF(A1:rank, A2::rank);  -- This UDF returns a 
value (like v1, v2, v3 depending on A1::rank and A2::rank)
};
# output
# page_id, 1, v1
# page_id,  2, v2
# page_id, 3, v3

DUMP C;
{code}

P.S: I understand that I could have alternatively, flattened the fields of B 
and then done a GROUP on page_id and then iterated through the records calling 
'SOMEUDF' appropriately but that would be 2 map-reduce operations AFAIK. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1630) Support param_files to be loaded into HDFS

2010-09-20 Thread Viraj Bhat (JIRA)
Support param_files to be loaded into HDFS
--

 Key: PIG-1630
 URL: https://issues.apache.org/jira/browse/PIG-1630
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I want to place the parameters of a Pig script in a param_file. 

But instead of this file being in the local file system where I run my java 
command, I want this to be on HDFS.

{code}
$ java -cp pig.jar org.apache.pig.Main -param_file hdfs://namenode/paramfile 
myscript.pig
{code}

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag

2010-09-16 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910414#action_12910414
 ] 

Viraj Bhat commented on PIG-1615:
-

I tested this on Pig 0.8, but with a downloaded version, which was little old. 

I re-downloaded the latest source, seems to be fixed.

Viraj

> Return code from Pig is 0 even if the job fails when using -M flag
> --
>
> Key: PIG-1615
> URL: https://issues.apache.org/jira/browse/PIG-1615
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Viraj Bhat
> Fix For: 0.8.0
>
>
> I have a Pig script of this form, which I used inside a workflow system such 
> as Oozie.
> {code}
> A = load  '$INPUT' using PigStorage();
> store A into '$OUTPUT';
> {code}
> I run this as with Multi-query optimization turned off :
> {quote}
> $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
> INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
> {quote}
> The directory "/user/viraj/junk1" is not present
> I get the following results:
> {quote}
> Input(s):
> Failed to read data from "/user/viraj/junk1"
> Output(s):
> Failed to produce result in "/user/viraj/junk2"
> {quote}
> This is expected, but the return code is still 0
> {code}
> $ echo $?
> 0
> {code}
> If I run this script with Multi-query optimization turned on, it gives, a 
> return code of 2, which is correct.
> {code}
> $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
> INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
> ...
> $ echo $?
> 2
> {code}
> I believe a wrong return code from Pig, is causing Oozie to believe that Pig 
> script succeeded.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag

2010-09-16 Thread Viraj Bhat (JIRA)
Return code from Pig is 0 even if the job fails when using -M flag
--

 Key: PIG-1615
 URL: https://issues.apache.org/jira/browse/PIG-1615
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


I have a Pig script of this form, which I used inside a workflow system such as 
Oozie.
{code}
A = load  '$INPUT' using PigStorage();
store A into '$OUTPUT';
{code}

I run this as with Multi-query optimization turned off :
{quote}
$java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
{quote}

The directory "/user/viraj/junk1" is not present

I get the following results:
{quote}
Input(s):
Failed to read data from "/user/viraj/junk1"
Output(s):
Failed to produce result in "/user/viraj/junk2"
{quote}

This is expected, but the return code is still 0
{code}
$ echo $?
0
{code}

If I run this script with Multi-query optimization turned on, it gives, a 
return code of 2, which is correct.

{code}
$ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
...
$ echo $?
2
{code}

I believe a wrong return code from Pig, is causing Oozie to believe that Pig 
script succeeded.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-282) Custom Partitioner

2010-09-15 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-282:
---

Release Note: 
This feature allows to specify Hadoop Partitioner for the following operations: 
GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed'  join). Partitioner 
controls the partitioning of the keys of the intermediate map-outputs. See 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html
 for more details.

To use this feature you can add PARTITION BY clause to the appropriate operator:
A = load 'input_data';
B = group A by $0 PARTITION BY 
org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
.
Here is the code for SimpleCustomPartitioner

public class SimpleCustomPartitioner extends Partitioner {
 //@Override
public int getPartition(PigNullableWritable key, Writable value, int 
numPartitions) {
if(key.getValueAsPigType() instanceof Integer) {
int ret = (((Integer)key.getValueAsPigType()).intValue() % 
numPartitions);
return ret;
   }
   else {
return (key.hashCode()) % numPartitions;
}
}
}

  was:
This feature allows to specify Hadoop Partitioner for the following operations: 
GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed'  join). Partitioner 
controls the partitioning of the keys of the intermediate map-outputs. See 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Partitioner.html
 for more details.

To use this feature you can add PARTITION BY clause to the appropriate operator:
A = load 'input_data';
B = group A by $0 PARTITION BY 
org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
.
Here is the code for SimpleCustomPartitioner

public class SimpleCustomPartitioner extends Partitioner {
 //@Override
public int getPartition(PigNullableWritable key, Writable value, int 
numPartitions) {
if(key.getValueAsPigType() instanceof Integer) {
int ret = (((Integer)key.getValueAsPigType()).intValue() % 
numPartitions);
return ret;
   }
   else {
return (key.hashCode()) % numPartitions;
}
}
}


> Custom Partitioner
> --
>
> Key: PIG-282
> URL: https://issues.apache.org/jira/browse/PIG-282
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Amir Youssefi
>Assignee: Aniket Mokashi
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, 
> CustomPartitionerTest.patch
>
>
> By adding custom partitioner we can give control over which output partition 
> a key (/value) goes to. We can add keywords to language e.g. 
> PARTITION BY UDF(...)
> or a similar syntax. UDF returns a number between 0 and n-1 where n is number 
> of output partitions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1586:


Description: 
I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
 -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
{code}

{code}
register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj

  was:
I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
 -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
{code}

register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj




> Parameter subsitution using -param option runs into problems when substituing 
> entire pig statements in a shell script (maybe this is a bash problem)
> 
>
> Key: PIG-1586
> URL: https://issues.apache.org/jira/browse/PIG-1586
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Viraj Bhat
>
> I have a Pig script as a template:
> {code}
> register Countwords.jar;
> A = $INPUT;
> B = FOREACH A GENERATE
> examples.udf.SubString($0,0,1),
> $1 as num;
> C = GROUP B BY $0;
> D = FOREACH C GENERATE group, SUM(B.num);
> STORE D INTO $OUTPUT;
> {code}
> I attempt to do Parameter substitutions using the following:
> Using Shell script:
> {code}
> #!/bin/bash
> java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r 
> -file sub.pig \
>  -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' 
> USING PigStorage() AS (word:chararray,num:int)) by (word),(load 
> '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
> (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
>  -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
> {code}
> {code}
> register Countwords.jar;
> A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
> (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
> PigStorage() AS (word:chararray,num:int)) by (word)) generate 
> flatten(examples.udf.CountWords(runsub.sh,,)));
> B = FOREACH A GENERATE
> examples.udf.SubString($0,0,1),
> $1 as num;
> C = GROUP B BY $0;
> D = FOREACH C GENERATE group, SUM(B.num);
> STORE D INTO /user/viraj/output;
> {code}
> The shell substitutes the $0 before passing it to java. 
> a) Is there a workaround for this?  
> b) Is this is Pig param problem?
> Viraj

-- 
This message is automatically g

[jira] Created: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Viraj Bhat (JIRA)
Parameter subsitution using -param option runs into problems when substituing 
entire pig statements in a shell script (maybe this is a bash problem)


 Key: PIG-1586
 URL: https://issues.apache.org/jira/browse/PIG-1586
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Viraj Bhat


I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
 -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
{code}

register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1576) Difference in Semantics between Load statement in Pig and HDFS client on Command line

2010-08-27 Thread Viraj Bhat (JIRA)
Difference in Semantics between Load statement in Pig and HDFS client on 
Command line
-

 Key: PIG-1576
 URL: https://issues.apache.org/jira/browse/PIG-1576
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0, 0.6.0
Reporter: Viraj Bhat


Here is my directory structure on HDFS which I want to access using Pig. 
This is a sample, but in real use case I have more than 100 of these 
directories.
{code}
$ hadoop fs -ls /user/viraj/recursive/
Found 3 items
drwxr-xr-x   - viraj supergroup  0 2010-08-26 11:25 
/user/viraj/recursive/20080615
drwxr-xr-x   - viraj supergroup  0 2010-08-26 11:25 
/user/viraj/recursive/20080616
drwxr-xr-x   - viraj supergroup  0 2010-08-26 11:25 
/user/viraj/recursive/20080617
{code}
Using the command line I am access them using variety of options:
{code}
$ hadoop fs -ls /user/viraj/recursive/{200806}{15..17}/
-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080615/kv2.txt
-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080616/kv2.txt
-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080617/kv2.txt

$ hadoop fs -ls /user/viraj/recursive/{20080615..20080617}/

-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080615/kv2.txt

-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080616/kv2.txt

-rw-r--r--   1 viraj supergroup   5791 2010-08-26 11:25 
/user/viraj/recursive/20080617/kv2.txt
{code}

I have written a Pig script, all the below combination of load statements do 
not work?
{code}
--A = load '/user/viraj/recursive/{200806}{15..17}/' using PigStorage('\u0001') 
as (k:int, v:chararray);
A = load '/user/viraj/recursive/{20080615..20080617}/' using 
PigStorage('\u0001') as (k:int, v:chararray);
AL = limit A 10;
dump AL;
{code}

I get the following error in Pig 0.8
{noformat}
2010-08-27 16:34:27,704 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil 
- 1 map reduce job(s) failed!
2010-08-27 16:34:27,711 [main] INFO  org.apache.pig.tools.pigstats.PigStats - 
Script Statistics: 
HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
0.20.2  0.8.0-SNAPSHOT  viraj   2010-08-27 16:34:24 2010-08-27 16:34:27 
LIMIT
Failed!
Failed Jobs:
JobId   Alias   Feature Message Outputs
N/A A,ALMessage: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for: /user/viraj/recursive/{20080615..20080617}/
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
Pattern hdfs://localhost:9000/user/viraj/recursive/{20080615..20080617} matches 
0 files
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268)
... 7 more
hdfs://localhost:9000/tmp/temp241388470/tmp987803889,
{noformat}

The following works:
{code}
A = load '/user/viraj/recursive/{200806}{15,16,17}/' using PigStorage('\u0001') 
as (k:int, v:chararray);
AL = limit A 10;
dump AL;
{code}

Why is there an inconsistency between HDFS client and Pig?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1561) XMLLoader in Piggybank does not support bz2 or gzip compressed XML files

2010-08-23 Thread Viraj Bhat (JIRA)
XMLLoader in Piggybank does not support bz2 or gzip compressed XML files


 Key: PIG-1561
 URL: https://issues.apache.org/jira/browse/PIG-1561
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I have a simple Pig script which uses the XMLLoader after the Piggybank is 
built.

{code}
register piggybank.jar;
A = load '/user/viraj/capacity-scheduler.xml.gz' using 
org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray);
B = limit A 1;
dump B;
--store B into '/user/viraj/handlegz' using PigStorage();
{code}


returns empty tuple
{code}
()
{code}

If you supply the uncompressed XML file, you get
{code}
(
mapred.capacity-scheduler.queue.my.capacity
10
Percentage of the number of slots in the cluster that are
  guaranteed to be available for jobs in this queue.

  )
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1547) Piggybank MultiStorage does not scale when processing around 7k records per bucket

2010-08-17 Thread Viraj Bhat (JIRA)
Piggybank MultiStorage does not scale when processing around 7k records per 
bucket
--

 Key: PIG-1547
 URL: https://issues.apache.org/jira/browse/PIG-1547
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I am trying to use the MultiStorage piggybank UDF
{code}
register pig-svn/trunk/contrib/piggybank/java/piggybank.jar;
A = load '/user/viraj/largebucketinput.txt' using PigStorage('\u0001') as 
(a,b,c);
STORE A INTO '/user/viraj/multistore' USING 
org.apache.pig.piggybank.storage.MultiStorage('/user/viraj/multistore', '1', 
'none', '\u0001');
{code}
The file "largebucketinput.txt" is around 85MB in size and for each "b" we have 
512 values starting from 0-511 and each value of b or a bucket contains 7k 
records

a) On a multi-node hadoop installation:
The above Pig script which spawn a single Map only job does not succeed and is 
killed by the TT, for running above the memory limit.

== Message == 
TaskTree [pid=24584,tipID=attempt_201008110143_101976_m_00_0] is running 
beyond memory-limits. Current usage : 1661034496bytes. Limit : 1610612736bytes.
== Message == 
We tried increasing the Map slots but it does not succeed.

b) On a single node hadoop installation:
The pig script fails with the following message in the mappers:

2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_7687609983190239805_126509
2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_2734778934507357565_126509
2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_-1293917224803067377_126509
2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_-2272713260404734116_126509
2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
Exception: java.io.IOException: Unable to create new block.
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2781)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232)

2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_-2272713260404734116_126509 bad datanode[0] nodes == null
2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Could not get 
block locations. Source file 
"/user/viraj/multistore/_temporary/_attempt_201005141440_0178_m_01_0/444/444-1"
 - Aborting...
2010-08-17 16:37:48,619 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at org.apache.hadoop.io.Text.readString(Text.java:400)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2837)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2762)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232)
2010-08-17 16:37:48,622 INFO org.apache.hadoop.mapred.TaskRunner: Runnning 
cleanup for the task


Need to investigate more.

Viraj






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-05 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895858#action_12895858
 ] 

Viraj Bhat commented on PIG-1537:
-

Hi Olga, I have given the specific script with UDF's for Daniel to test.  
Thanks Daniel for your help.
The script which does not use Column Pruner optimization or disables it using 
-t gives correct results.
Viraj

> Column pruner causes wrong results when using both Custom Store UDF and 
> PigStorage
> --
>
> Key: PIG-1537
> URL: https://issues.apache.org/jira/browse/PIG-1537
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> I have script which is of this pattern and it uses 2 StoreFunc's:
> {code}
> register loader.jar
> register piggy-bank/java/build/storage.jar;
> %DEFAULT OUTPUTDIR /user/viraj/prunecol/
> ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
> ss_sc_filtered_0 = FILTER ss_sc_0 BY
> a#'id' matches '1.*' OR
> a#'id' matches '2.*' OR
> a#'id' matches '3.*' OR
> a#'id' matches '4.*';
> ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
> ss_sc_filtered_1 = FILTER ss_sc_1 BY
> a#'id' matches '65.*' OR
> a#'id' matches '466.*' OR
> a#'id' matches '043.*' OR
> a#'id' matches '044.*' OR
> a#'id' matches '0650.*' OR
> a#'id' matches '001.*';
> ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
> ss_sc_all_proj = FOREACH ss_sc_all GENERATE
> a#'query' as query,
> a#'testid' as testid,
> a#'timestamp' as timestamp,
> a,
> b,
> c;
> ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
> ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
> STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
> ss_sc_all_map_count = group ss_sc_all_map all;
> count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
> record_count,COUNT($1);
> STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
> {code}
> I run this script using:
> a) java -cp pig0.7.jar script.pig
> b) java -cp pig0.7.jar -t PruneColumns script.pig
> What I observe is that the alias "count" produces the same number of records 
> but "ss_sc_all_map" have different sizes when run with above 2 options.
> Is due to the fact that there are 2 store func's used?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-04 Thread Viraj Bhat (JIRA)
Column pruner causes wrong results when using both Custom Store UDF and 
PigStorage
--

 Key: PIG-1537
 URL: https://issues.apache.org/jira/browse/PIG-1537
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I have script which is of this pattern and it uses 2 StoreFunc's:
{code}

register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias "count" produces the same number of records 
but "ss_sc_all_map" have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-04 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1537:


Description: 
I have script which is of this pattern and it uses 2 StoreFunc's:

{code}
register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
{code}

I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias "count" produces the same number of records 
but "ss_sc_all_map" have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj

  was:
I have script which is of this pattern and it uses 2 StoreFunc's:
{code}

register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias "count" produces the same number of records 
but "ss_sc_all_map" have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj


> Column pruner causes wrong results when using both Custom Store UDF and 
> PigStorage
> --
>
> Key: PIG-1537
> URL: https://issues.apache.org/jira/browse/PIG-1537
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>
> I have script which is of this pattern and it uses 2 StoreFunc's:
> {code}
> register loader.jar
> register piggy-bank/java/build/storage.jar;
> %DEFAULT OUTPUTDIR /user/viraj/prunecol/
> ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
> ss_sc_filtered_0 = FILTER ss_sc_0 BY
> a#'id' matches '1.*' OR
> a#'id' matches '2.*' OR
> a#'id' matches '3.*' OR
> a#'id' matches '

[jira] Created: (PIG-1529) Equating aliases does not work (B = A)

2010-07-30 Thread Viraj Bhat (JIRA)
Equating aliases does not work (B = A)
--

 Key: PIG-1529
 URL: https://issues.apache.org/jira/browse/PIG-1529
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I wanted to do a self-join

{code}
1   one
1   uno
2   two
2   dos
3   three
3   tres
{code}

vi...@machine~/pigscripts >pig -x local script.pig
script.pig
-- since the below does not work
{code}
A = load 'Adataset.txt' as (key:int, value:chararray);
C = join A by key, A by key;
dump C;
{code} 

-- i tried the below it fails with:
{code}
A = load 'Adataset.txt' as (key:int, value:chararray);
B = A;
C = join A by key, B by key;
dump C;
{code}

2010-07-30 23:19:32,789 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Currently PIG does not support assigning an 
existing relation (B) to another alias (A)
Details at logfile: /homes/viraj/pigscripts/pig_1280531249235.log

There is a workaround currently:
{code}
A = load 'Adataset.txt' as (key:int, value:chararray);
B = foreach A generate *;
C = join A by key, B by key;
dump C;
{code}

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1528) Enable use of similar aliases when doing a join :(ERROR 1108: Duplicate schema alias:)

2010-07-30 Thread Viraj Bhat (JIRA)
Enable use of similar aliases when doing a join :(ERROR 1108: Duplicate schema 
alias:)
--

 Key: PIG-1528
 URL: https://issues.apache.org/jira/browse/PIG-1528
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I am doing a self join:

Input file is tab separated:
{code}
1   one
1   uno
2   two
2   dos
3   three
3   tres
{code}

vi...@machine~/pigscripts >pig -x local script.pig

{code}
A = load 'Adataset.txt' as (key:int, value:chararray);
C = join A by key, A by key;
dump C;
{code} 


2010-07-30 23:09:05,422 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1108: Duplicate schema alias: A::key in "C"
Details at logfile: /homes/viraj/pigscripts/pig_1280531249235.log






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-05-06 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864963#action_12864963
 ] 

Viraj Bhat commented on PIG-1345:
-

Richard thanks for suggesting a workaround. The error message is definitely 
more verbose than the original one. 

At least in one way the user can know as to where the cast is an issue in the, 
maybe in some addition taking place in the script. 

This Jira was originally created as task to correlate exactly on which line 
"int is implicitly cast to float", which I believe is hard to do in the current 
parser as we do not keep track of line number.

Viraj

> Link casting errors in POCast to actual lines numbers in Pig script
> ---
>
> Key: PIG-1345
> URL: https://issues.apache.org/jira/browse/PIG-1345
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>
> For the purpose of easy debugging, I would be nice to find out where  my 
> warnings are coming from is in the pig script. 
> The only known process is to comment out lines in the Pig script and see if 
> these warnings go away.
> 2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
> 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
> 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26
> I think this may need us to keep track of the line numbers of the Pig script 
> (via out javacc parser) and maintain it in the logical and physical plan.
> It would help users in debugging simple errors/warning related to casting.
> Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?
> Do we need to change the parser to something other than javacc to make this 
> task simpler?
> "Standardize on Parser and Scanner Technology"
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1378) har url not usable in Pig scripts

2010-05-03 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat reopened PIG-1378:
-


Pradeep, 
 After rerunning with patch the following revision

Apache Pig version 0.8.0-dev (r940560) 
compiled May 03 2010, 12:22:35

{code}
grunt> a = load 
'har:///user/viraj/project/dev/subproject/5m/data/201003042355/0/0_1/part-0'
 using PigStorage('\u0001');
grunt> alimit = limit a 10;
grunt> dump alimit;
{code}

{noformat}
2010-05-04 02:17:22,196 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2118: Unable to create input splits for: 
har:///user/viraj/project/dev/subproject/5m/data/201003042355/0/0_1/part-0
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: No FileSystem for scheme: myhdfs
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:193)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
... 7 more
{noformat}

Is this a problem with Hadoop/Pig?

> har url not usable in Pig scripts
> -
>
> Key: PIG-1378
> URL: https://issues.apache.org/jira/browse/PIG-1378
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>Assignee: Pradeep Kamath
> Fix For: 0.8.0
>
> Attachments: PIG-1378-2.patch, PIG-1378-3.patch, PIG-1378-4.patch, 
> PIG-1378.patch
>
>
> I am trying to use har (Hadoop Archives) in my Pig script.
> I can use them through the HDFS shell
> {noformat}
> $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
> Found 1 items
> -rw---   5 viraj users1537234 2010-04-14 09:49 
> user/viraj/project/subproject/files/size/data/part-1
> {noformat}
> Using similar URL's in grunt yields
> {noformat}
> grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; 
> grunt> dump a;
> {noformat}
> {noformat}
> 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2998: Unhandled internal error. 
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible 
> file URI scheme: har : hdfs
> 2010-04-14 22:08:48,814 [main] WARN  org.apache.pig.tools.grunt.Grunt - There 
> is no log file to write to.
> 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
> java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
> Incompatible file URI scheme: har : hdfs
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptPars

[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-26 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861134#action_12861134
 ] 

Viraj Bhat commented on PIG-798:


Ashutosh thanks for clarifying, we will wait till that bug is fixed in 
BinStorage

Viraj

> Schema errors when using PigStorage and none when using BinStorage in 
> FOREACH??
> ---
>
> Key: PIG-798
> URL: https://issues.apache.org/jira/browse/PIG-798
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
>Reporter: Viraj Bhat
> Attachments: binstoragecreateop, schemaerr.pig, visits.txt
>
>
> In the following script I have a tab separated text file, which I load using 
> PigStorage() and store using BinStorage()
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
> url:chararray, time:chararray);
> B = group A by name;
> store B into '/user/viraj/binstoragecreateop' using BinStorage();
> dump B;
> {code}
> I later load file 'binstoragecreateop' in the following way.
> {code}
> A = load '/user/viraj/binstoragecreateop' using BinStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> Result
> ===
> (Amy)
> (Fred)
> ===
> The above code work properly and returns the right results. If I use 
> PigStorage() to achieve the same, I get the following error.
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> ===
> {code}
> 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
> Field Schema: name: chararray
> Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
> {code}
> ===
> So why should the semantics of BinStorage() be different from PigStorage() 
> where is ok not to specify a schema??? Should it not be consistent across 
> both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

2010-04-26 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861106#action_12861106
 ] 

Viraj Bhat commented on PIG-1211:
-

Ashutosh, yes as more and more people adopt Pig, they expect some type of 
guarantees, since Pig is designed to help people with no experience in writing 
M/R programs.

If I am a novice user I have a small typo, do I wait for 3-4 hours to discover 
that there is a syntax error? I have not only wasted the CPU cycles but also 
the users productivity.

The problem here is that dump and hadoop shell commands are treated differently 
in Pig scripts and Multi-query optimizations are ignored.

I have listed what Milind and Dmitry is suggesting. Maybe this is the way 
future Pig Language will compile to give you a hadoop jar file in sequence or 
as a DAG.

Pigcc -L myScript.pig -> parses pig script, generates logical plan, and stores 
it in myScript.pig.l

Pigcc -P myScript.pig.l -> produces physical plan from the logical plan, and 
stores it in myScript.pig.p

Pigcc -M myScript.pig.p -> produces map-reduce plan, myScript.pig.m

Pig myScript.pig.m -> interprets the MR plan. This can be split into multiple 
sequential MR jobs plans too,  myScript.pig.m.{1,2,3..}, so that a way to 
execute the pig script is to run

Hadoop jar pigRT.jar myScript.pig.m.1
Hadoop jar pigRT.jar myScript.pig.m.2
Hadoop jar pigRT.jar myScript.pig.m.3
Hadoop jar pigRT.jar myScript.pig.m.4

Thanks Viraj


> Pig script runs half way after which it reports syntax error
> 
>
> Key: PIG-1211
> URL: https://issues.apache.org/jira/browse/PIG-1211
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.8.0
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
> col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the 
> first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of 
> explain to get the same syntax error?  In this way I can ensure that I do not 
> run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the 
> store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-26 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861097#action_12861097
 ] 

Viraj Bhat commented on PIG-798:


Hi Ashutosh,
 Yes that is possible, I know that we can do that in PigStorage() but why can 
we not do this in PigStorage? What do I need to cast as (chararray) ?
{code}
A = load 'somedata' using PigStorage();
B = foreach A generate $0 as name:chararray;
dump B;
{code}

But this is possible in BinStorage(), why is this not consistent?

Is it that BinStorage() has schemas embedded while PigStorage() does not? 

Should this not be fixed to make it consistent across storage formats?

Viraj

> Schema errors when using PigStorage and none when using BinStorage in 
> FOREACH??
> ---
>
> Key: PIG-798
> URL: https://issues.apache.org/jira/browse/PIG-798
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
>Reporter: Viraj Bhat
> Attachments: binstoragecreateop, schemaerr.pig, visits.txt
>
>
> In the following script I have a tab separated text file, which I load using 
> PigStorage() and store using BinStorage()
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
> url:chararray, time:chararray);
> B = group A by name;
> store B into '/user/viraj/binstoragecreateop' using BinStorage();
> dump B;
> {code}
> I later load file 'binstoragecreateop' in the following way.
> {code}
> A = load '/user/viraj/binstoragecreateop' using BinStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> Result
> ===
> (Amy)
> (Fred)
> ===
> The above code work properly and returns the right results. If I use 
> PigStorage() to achieve the same, I get the following error.
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> ===
> {code}
> 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
> Field Schema: name: chararray
> Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
> {code}
> ===
> So why should the semantics of BinStorage() be different from PigStorage() 
> where is ok not to specify a schema??? Should it not be consistent across 
> both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-23 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-798:
---

Affects Version/s: 0.6.0
   0.5.0
   0.4.0
   0.3.0
   0.7.0
   0.8.0

> Schema errors when using PigStorage and none when using BinStorage in 
> FOREACH??
> ---
>
> Key: PIG-798
> URL: https://issues.apache.org/jira/browse/PIG-798
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
>Reporter: Viraj Bhat
> Attachments: binstoragecreateop, schemaerr.pig, visits.txt
>
>
> In the following script I have a tab separated text file, which I load using 
> PigStorage() and store using BinStorage()
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
> url:chararray, time:chararray);
> B = group A by name;
> store B into '/user/viraj/binstoragecreateop' using BinStorage();
> dump B;
> {code}
> I later load file 'binstoragecreateop' in the following way.
> {code}
> A = load '/user/viraj/binstoragecreateop' using BinStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> Result
> ===
> (Amy)
> (Fred)
> ===
> The above code work properly and returns the right results. If I use 
> PigStorage() to achieve the same, I get the following error.
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> ===
> {code}
> 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
> Field Schema: name: chararray
> Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
> {code}
> ===
> So why should the semantics of BinStorage() be different from PigStorage() 
> where is ok not to specify a schema??? Should it not be consistent across 
> both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860452#action_12860452
 ] 

Viraj Bhat commented on PIG-798:


Hi Ashutosh,
 The problem here is not about using the data interchangeably between 
BinStorage() and PigStorage(), it is about the consistency issues in schema. 
Sorry if the description was unclear.

I can see that it is possible to write statements such as this using 
BinStorage() 

{code}
A = load 'somedata' using BinStorage();
B = foreach A generate $0 as name:chararray;
dump B;
{code}

and not write it using PigStorage().

Should we not support the following statement, as a user I am interested in 
projecting the first column and casting it to a chararray. I am not interested 
in knowing what the schemas are of other columns!!

Fails when I do the following:
{code}
A = load 'somedata' using PigStorage();
B = foreach A generate $0 as name:chararray;
dump B;
{code}

Can you tell me why the schema specification in FOREACH GENERATE works with 
BinStorage and not in PigStorage? 

Viraj

> Schema errors when using PigStorage and none when using BinStorage in 
> FOREACH??
> ---
>
> Key: PIG-798
> URL: https://issues.apache.org/jira/browse/PIG-798
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Viraj Bhat
> Attachments: binstoragecreateop, schemaerr.pig, visits.txt
>
>
> In the following script I have a tab separated text file, which I load using 
> PigStorage() and store using BinStorage()
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
> url:chararray, time:chararray);
> B = group A by name;
> store B into '/user/viraj/binstoragecreateop' using BinStorage();
> dump B;
> {code}
> I later load file 'binstoragecreateop' in the following way.
> {code}
> A = load '/user/viraj/binstoragecreateop' using BinStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> Result
> ===
> (Amy)
> (Fred)
> ===
> The above code work properly and returns the right results. If I use 
> PigStorage() to achieve the same, I get the following error.
> {code}
> A = load '/user/viraj/visits.txt' using PigStorage();
> B = foreach A generate $0 as name:chararray;
> dump B;
> {code}
> ===
> {code}
> 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
> Field Schema: name: chararray
> Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
> {code}
> ===
> So why should the semantics of BinStorage() be different from PigStorage() 
> where is ok not to specify a schema??? Should it not be consistent across 
> both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1339) International characters in column names not supported

2010-04-23 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1339:


Affects Version/s: 0.7.0
   0.8.0

> International characters in column names not supported
> --
>
> Key: PIG-1339
> URL: https://issues.apache.org/jira/browse/PIG-1339
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0, 0.7.0, 0.8.0
>Reporter: Viraj Bhat
>
> There is a particular use-case in which someone specifies a column name to be 
> in International characters.
> {code}
> inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
> describe inputdata;
> dump inputdata;
> {code}
> ==
> Pig Stack Trace
> ---
> ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
> Encountered: "\u3042" (12354), after : ""
> org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 
> 1, column 64.  Encountered: "\u3042" (12354), after : ""
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:391)
> ==
> Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1339) International characters in column names not supported

2010-04-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860445#action_12860445
 ] 

Viraj Bhat commented on PIG-1339:
-

Hi Ashutosh this does not work in trunk. I am using the latest build:

{code}
$java -cp  ~/pig-svn/trunk/pig.jar org.apache.pig.Main -version

Apache Pig version 0.8.0-dev (r937554) 
compiled Apr 23 2010, 16:57:32

{code}

2010-04-23 17:31:41,448 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Lexical error at line 1, column 71.  Encountered: 
"\u3042" (12354), after : ""


This is a valid bug.

Viraj

> International characters in column names not supported
> --
>
> Key: PIG-1339
> URL: https://issues.apache.org/jira/browse/PIG-1339
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0, 0.7.0, 0.8.0
>Reporter: Viraj Bhat
>
> There is a particular use-case in which someone specifies a column name to be 
> in International characters.
> {code}
> inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
> describe inputdata;
> dump inputdata;
> {code}
> ==
> Pig Stack Trace
> ---
> ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
> Encountered: "\u3042" (12354), after : ""
> org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 
> 1, column 64.  Encountered: "\u3042" (12354), after : ""
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:391)
> ==
> Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

2010-04-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860419#action_12860419
 ] 

Viraj Bhat commented on PIG-1211:
-

Ashutosh, I feel that the user may not be interested in running his script 
first using explain finding his syntax error and then again running it again to 
get his results.  
They expect Pig to tell them all the errors upfront before submitting a M/R job.

Explain was not designed for checking syntax error in scripts. 

I believe that if you have a dump statement, explain -script will cause the 
script to run.

Is it not possible for Pig to find out that there is an error with "store" 
syntax? 

Viraj

> Pig script runs half way after which it reports syntax error
> 
>
> Key: PIG-1211
> URL: https://issues.apache.org/jira/browse/PIG-1211
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.8.0
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
> col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the 
> first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of 
> explain to get the same syntax error?  In this way I can ensure that I do not 
> run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the 
> store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-04-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860397#action_12860397
 ] 

Viraj Bhat commented on PIG-1345:
-

Which release will PIG:908 be fixed? 

Does it guarantee that if we fix PIG:908, then this issue will be solved?  

> Link casting errors in POCast to actual lines numbers in Pig script
> ---
>
> Key: PIG-1345
> URL: https://issues.apache.org/jira/browse/PIG-1345
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>
> For the purpose of easy debugging, I would be nice to find out where  my 
> warnings are coming from is in the pig script. 
> The only known process is to comment out lines in the Pig script and see if 
> these warnings go away.
> 2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
> 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
> 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
> Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26
> I think this may need us to keep track of the line numbers of the Pig script 
> (via out javacc parser) and maintain it in the logical and physical plan.
> It would help users in debugging simple errors/warning related to casting.
> Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?
> Do we need to change the parser to something other than javacc to make this 
> task simpler?
> "Standardize on Parser and Scanner Technology"
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1378) har url not usable in Pig scripts

2010-04-21 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859384#action_12859384
 ] 

Viraj Bhat commented on PIG-1378:
-

har:// currently works in Pig 0.7 when the hdfs location is specified.

> har url not usable in Pig scripts
> -
>
> Key: PIG-1378
> URL: https://issues.apache.org/jira/browse/PIG-1378
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
> Fix For: 0.8.0
>
>
> I am trying to use har (Hadoop Archives) in my Pig script.
> I can use them through the HDFS shell
> {noformat}
> $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
> Found 1 items
> -rw---   5 viraj users1537234 2010-04-14 09:49 
> user/viraj/project/subproject/files/size/data/part-1
> {noformat}
> Using similar URL's in grunt yields
> {noformat}
> grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; 
> grunt> dump a;
> {noformat}
> {noformat}
> 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2998: Unhandled internal error. 
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible 
> file URI scheme: har : hdfs
> 2010-04-14 22:08:48,814 [main] WARN  org.apache.pig.tools.grunt.Grunt - There 
> is no log file to write to.
> 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
> java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
> Incompatible file URI scheme: har : hdfs
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
> at org.apache.pig.Main.main(Main.java:357)
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
> Incompatible file URI scheme: har : hdfs
> at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249)
> at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472)
> ... 13 more
> {noformat}
> According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the 
> following as stated in the original description
> {noformat}
> grunt> a = load 
> 'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
> grunt> dump a;
> {noformat}
> {noformat}
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: 
> Unable to create input splits for: 
> har://namenode-location/user/viraj/project/subproject/files/size/data'; 
> ... 8 more
> Caused by: java.io.IOException: No FileSystem for scheme: namenode-location
> at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
> at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66)
> at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
> at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
> at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104)
> at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
> at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193)
> at .apache.hadoop.fs.Path.getFileSystem(Path.java:175)
> at 
> .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208)
> at 
> .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
> at 
> .apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
> at 
> .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245)
> {noformat}
> Viraj

-- 
This message is automaticall

[jira] Resolved: (PIG-829) DECLARE statement stop processing after special characters such as dot "." , "+" "%" etc..

2010-04-14 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat resolved PIG-829.


Fix Version/s: 0.7.0
   Resolution: Fixed

Pig 0.7 yields the correct result.
{code}
x = LOAD 'something' as (a:chararray, b:chararray);
y = FILTER x BY ( a MATCHES '^.*yahoo.*$' );
STORE y INTO 'foo.bar';
{code}

> DECLARE statement stop processing after special characters such as dot "." , 
> "+" "%" etc..
> --
>
> Key: PIG-829
> URL: https://issues.apache.org/jira/browse/PIG-829
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.3.0
>Reporter: Viraj Bhat
> Fix For: 0.7.0
>
>
> The below Pig script does not work well, when special characters are used in 
> the DECLARE statement.
> {code}
> %DECLARE OUT foo.bar
> x = LOAD 'something' as (a:chararray, b:chararray);
> y = FILTER x BY ( a MATCHES '^.*yahoo.*$' );
> STORE y INTO '$OUT';
> {code}
> When the above script is run in the dry run mode; the substituted file does 
> not contain the special character.
> {code}
> java -cp pig.jar:/homes/viraj/hadoop-0.18.0-dev/conf -Dhod.server='' 
> org.apache.pig.Main -r declaresp.pig
> {code}
> Resulting file: "declaresp.pig.substituted"
> {code}
> x = LOAD 'something' as (a:chararray, b:chararray);
> y = FILTER x BY ( a MATCHES '^.*yahoo.*$' );
> STORE y INTO 'foo';
> {code}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (PIG-518) LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag

2010-04-14 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat resolved PIG-518.


Fix Version/s: 0.7.0
   Resolution: Fixed

> LOBinCond  exception in LogicalPlanValidationExecutor when providing default 
> values for bag
> ---
>
> Key: PIG-518
> URL: https://issues.apache.org/jira/browse/PIG-518
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Viraj Bhat
> Fix For: 0.7.0
>
> Attachments: queries.txt, sports_views.txt
>
>
> The following piece of Pig script, which provides default values for bags 
> {('','')}  when the COUNT returns 0 fails with the following error. (Note: 
> Files used in this script are enclosed on this Jira.)
> 
> a = load 'sports_views.txt' as (col1, col2, col3);
> b = load 'queries.txt' as (colb1,colb2,colb3);
> mycogroup = cogroup a by col1 inner, b by colb1;
> mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) > 0L ? 
> b.(colb2,colb3) : {('','')}));
> dump mynewalias;
> 
> java.io.IOException: Unable to open iterator for alias: mynewalias [Unable to 
> store for alias: mynewalias [Can't overwrite cause]]
>  at java.lang.Throwable.initCause(Throwable.java:320)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1494)
>  at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:85)
>  at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:28)
>  at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
>  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2345)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2252)
>  at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:121)
>  at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:40)
>  at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
>  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
>  at 
> org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
>  at 
> org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:
> 79)
>  at org.apache.pig.PigServer.compileLp(PigServer.java:684)
>  at org.apache.pig.PigServer.compileLp(PigServer.java:655)
>  at org.apache.pig.PigServer.store(PigServer.java:433)
>  at org.apache.pig.PigServer.store(PigServer.java:421)
>  at org.apache.pig.PigServer.openIterator(PigServer.java:384)
>  at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269)
>  at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178)
>  at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
>  at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
>  at org.apache.pig.Main.main(Main.java:306)
> Caused by: java.io.IOException: Unable to store for alias: mynewalias [Can't 
> overwrite cause]
>  ... 26 more
> Caused by: java.lang.IllegalStateException: Can't overwrite cause
>  ... 26 more
> 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (PIG-518) LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag

2010-04-14 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857157#action_12857157
 ] 

Viraj Bhat commented on PIG-518:


The above script generates the following error in Pig 0.7

2010-04-14 17:10:49,807 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1048: Two inputs of BinCond must have compatible schemas. left hand side: b: 
bag({colb2: bytearray,colb3: bytearray}) right hand side: 
bag({(chararray,chararray)})


A type cast to the right type solves the problem.

{code}
a = load 'sports_views.txt' as (col1:chararray, col2:chararray, 
col3:chararray); 
b = load 'queries.txt' as (colb1:chararray,colb2:chararray,colb3:chararray); 
mycogroup = cogroup a by col1 inner, b by colb1; 
mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) > 0L ? 
b.(colb2,colb3) : {('','')}));
dump mynewalias; 
{code}

(alice,lakers,3,ipod,3)
(alice,warriors,7,ipod,3)
(peter,sun,7,sun,4)
(peter,nets,7,sun,4)

Closing bug as Pig yields the correct error message which the user can use to 
recode his script



> LOBinCond  exception in LogicalPlanValidationExecutor when providing default 
> values for bag
> ---
>
> Key: PIG-518
> URL: https://issues.apache.org/jira/browse/PIG-518
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Viraj Bhat
> Attachments: queries.txt, sports_views.txt
>
>
> The following piece of Pig script, which provides default values for bags 
> {('','')}  when the COUNT returns 0 fails with the following error. (Note: 
> Files used in this script are enclosed on this Jira.)
> 
> a = load 'sports_views.txt' as (col1, col2, col3);
> b = load 'queries.txt' as (colb1,colb2,colb3);
> mycogroup = cogroup a by col1 inner, b by colb1;
> mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) > 0L ? 
> b.(colb2,colb3) : {('','')}));
> dump mynewalias;
> 
> java.io.IOException: Unable to open iterator for alias: mynewalias [Unable to 
> store for alias: mynewalias [Can't overwrite cause]]
>  at java.lang.Throwable.initCause(Throwable.java:320)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1494)
>  at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:85)
>  at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:28)
>  at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
>  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2345)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2252)
>  at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:121)
>  at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:40)
>  at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
>  at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
>  at 
> org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
>  at 
> org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
>  at 
> org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:
> 79)
>  at org.apache.pig.PigServer.compileLp(PigServer.java:684)
>  at org.apache.pig.PigServer.compileLp(PigServer.java:655)
>  at org.apache.pig.PigServer.store(PigServer.java:433)
>  at org.apache.pig.PigServer.store(PigServer.java:421)
>  at org.apache.pig.PigServer.openIterator(PigServer.java:384)
>  at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269)
>  at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178)
>  at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
>  at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
>  at org.apache.pig.Main.main(Main.java:306)
> Caused by: java.io.IOException: Unable to store for alias: mynewalias [Can't 
> overwrite cause]
>  ... 26 more
> Caused by: java.lang.IllegalStateException: Can't overwrite cause
>  ... 26 more
> ==

[jira] Updated: (PIG-1378) har url not usable in Pig scripts

2010-04-14 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1378:


Description: 
I am trying to use har (Hadoop Archives) in my Pig script.

I can use them through the HDFS shell
{noformat}
$hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
Found 1 items
-rw---   5 viraj users1537234 2010-04-14 09:49 
user/viraj/project/subproject/files/size/data/part-1
{noformat}

Using similar URL's in grunt yields
{noformat}
grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; 
grunt> dump a;
{noformat}


{noformat}
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. 
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file 
URI scheme: har : hdfs
2010-04-14 22:08:48,814 [main] WARN  org.apache.pig.tools.grunt.Grunt - There 
is no log file to write to.
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
Incompatible file URI scheme: har : hdfs
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
Incompatible file URI scheme: har : hdfs
at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249)
at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472)
... 13 more
{noformat}

According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the 
following as stated in the original description

{noformat}
grunt> a = load 
'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
grunt> dump a;
{noformat}

{noformat}
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: 
Unable to create input splits for: 
har://namenode-location/user/viraj/project/subproject/files/size/data'; 
... 8 more
Caused by: java.io.IOException: No FileSystem for scheme: namenode-location
at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66)
at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104)
at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193)
at .apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at 
.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at 
.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245)
{noformat}

Viraj

  was:
I am trying to use har (Hadoop Archives) in my Pig script.

I can use them through the HDFS shell
{noformat}
$hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
Found 1 items
-rw---   5 viraj users1537234 2010-04-14 09:49 
user/viraj/project/subproject/files/size/data/part-1
{noformat}

Using similar URL's in grunt yields
{noformat}
grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; 
grunt> dump a;
{noformat}


{noformat}
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. 
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: I

[jira] Created: (PIG-1378) har url not usable in Pig scripts

2010-04-14 Thread Viraj Bhat (JIRA)
har url not usable in Pig scripts
-

 Key: PIG-1378
 URL: https://issues.apache.org/jira/browse/PIG-1378
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I am trying to use har (Hadoop Archives) in my Pig script.

I can use them through the HDFS shell
{noformat}
$hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
Found 1 items
-rw---   5 viraj users1537234 2010-04-14 09:49 
user/viraj/project/subproject/files/size/data/part-1
{noformat}

Using similar URL's in grunt yields
{noformat}
grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; 
grunt> dump a;
{noformat}


{noformat}
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. 
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file 
URI scheme: har : hdfs
2010-04-14 22:08:48,814 [main] WARN  org.apache.pig.tools.grunt.Grunt - There 
is no log file to write to.
2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
Incompatible file URI scheme: har : hdfs
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
Incompatible file URI scheme: har : hdfs
at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249)
at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472)
... 13 more
{noformat}

According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the 
following as stated in the original description

{noformat}
grunt> a = load 
'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
grunt> dump a;
{noformat}

{noformat}
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: 
Unable to create input splits for: 
har://namenode-location/user/viraj/project/subproject/files/size/data'; 
... 8 more
Caused by: java.io.IOException: No FileSystem for scheme: mithrilgold
at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66)
at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104)
at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193)
at .apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at 
.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at 
.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245)
{noformat}

Viraj

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1377) Pig/Zebra fails without proper error message when the mapred.jobtracker.maxtasks.per.job exceeds threshold

2010-04-13 Thread Viraj Bhat (JIRA)
Pig/Zebra fails without proper error message when the 
mapred.jobtracker.maxtasks.per.job exceeds threshold
--

 Key: PIG-1377
 URL: https://issues.apache.org/jira/browse/PIG-1377
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0, 0.7.0
Reporter: Viraj Bhat


I have a Zebra script which generates huge amount of mappers around 400K. The 
mapred.jobtracker.maxtasks.per.job is currently set at 200k. The job fails at 
the initialization phase. It is very hard to find out the cause.

We need a way to report the right error message to users. Unfortunately for Pig 
to get this error in the backend, Map Reduce Jira: 
https://issues.apache.org/jira/browse/MAPREDUCE-1049 needs to be fixed.
{code}

-- Sorted format
%set default_parallel 100;
raw = load '/user/viraj/generated/raw/zebra-sorted/20100203'
USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted')
as (id,
timestamp,
code,
ip,
host,
reference,
type,
flag,
params : map[]
);
describe raw;
user_events = filter raw by id == 'viraj';
describe user_events;
dump user_events;
sorted_events = order user_events by id, timestamp;
dump sorted_events;
store sorted_events into 'finalresult';
{code}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (PIG-1374) Order by fails with java.lang.String cannot be cast to org.apache.pig.data.DataBag

2010-04-12 Thread Viraj Bhat (JIRA)
Order by fails with java.lang.String cannot be cast to 
org.apache.pig.data.DataBag
--

 Key: PIG-1374
 URL: https://issues.apache.org/jira/browse/PIG-1374
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0, 0.7.0
Reporter: Viraj Bhat


Script loads data from BinStorage(), then flattens columns and then sorts on 
the second column with order descending. The order by fails with the 
ClassCastException

{code}
register loader.jar;
a = load 'c2' using BinStorage();
b = foreach a generate org.apache.pig.CCMLoader(*);
describe b;
c = foreach b generate flatten($0);
describe c;
d = order c by $1 desc;
dump d;
{code}

The sampling job fails with the following error:
===
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.pig.data.DataBag
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:407)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:188)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:329)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:232)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:227)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:52)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:159)
===

The schema for b, c and d are as follows:

b: {bag_of_tuples: {tuple: (uuid: chararray,velocity: double)}}

c: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double}

d: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double}

If we modify this script to order on the first column it seems to work

{code}
register loader.jar;
a = load 'c2' using BinStorage();
b = foreach a generate org.apache.pig.CCMLoader(*);
describe b;
c = foreach b generate flatten($0);
describe c;
d = order c by $0 desc;
dump d;
{code}

(gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493)
(ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138)


There is a workaround to do a projection before ORDER

{code}
register loader.jar;
a = load 'c2' using BinStorage();
b = foreach a generate org.apache.pig.CCMLoader(*);
describe b;
c = foreach b generate flatten($0);
describe c;
newc = foreach c generate $0 as uuid, $1 as velocity;
newd = order newc by velocity desc;
dump newd;
{code}

(gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493)
(ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138)


The schema for the Loader is as follows:

{code}
  public Schema outputSchema(Schema input) {
 try{  
List list = new 
ArrayList();
list.add(new Schema.FieldSchema("uuid", 
DataType.CHARARRAY));
list.add(new Schema.FieldSchema("velocity", 
DataType.DOUBLE));
Schema tupleSchema = new Schema(list);
Schema.FieldSchema tupleFs = new 
Schema.FieldSchema("tuple", tupleSchema, DataType.TUPLE);
Schema bagSchema = new Schema(tupleFs);
bagSchema.setTwoLevelAccessRequired(true);
Schema.FieldSchema bagFs = new 
Schema.FieldSchema("bag_of_tuples",bagSchema, DataType.BAG);
return new Schema(bagFs);
}catch (Exception e){
return null;
}
}
{code}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2010-04-07 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat resolved PIG-756.


   Resolution: Fixed
Fix Version/s: 0.7.0

https://issues.apache.org/jira/browse/PIG-1053 fixes this issue.

> UDFs should have API for transparently opening and reading files from HDFS or 
> from local file system with only relative path
> 
>
> Key: PIG-756
> URL: https://issues.apache.org/jira/browse/PIG-756
> Project: Pig
>  Issue Type: Bug
>Reporter: David Ciemiewicz
> Fix For: 0.7.0
>
>
> I have a utility function util.INSETFROMFILE() that I pass a file name during 
> initialization.
> {code}
> define inQuerySet util.INSETFROMFILE(analysis/queries);
> A = load 'logs' using PigStorage() as ( date int, query chararray );
> B = filter A by inQuerySet(query);
> {code}
> This provides a computationally inexpensive way to effect map-side joins for 
> small sets plus functions of this style provide the ability to encapsulate 
> more complex matching rules.
> For rapid development and debugging purposes, I want this code to run without 
> modification on both my local file system when I do pig -exectype local and 
> on HDFS.
> Pig needs to provide an API for UDFs which allow them to either:
> 1) "know"  when they are in local or HDFS mode and let them open and read 
> from files as appropriate
> 2) just provide a file name and read statements and have pig transparently 
> manage local or HDFS opens and reads for the UDF
> UDFs need to read configuration information off the filesystem and it 
> simplifies the process if one can just flip the switch of -exectype local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2010-04-07 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854762#action_12854762
 ] 

Viraj Bhat commented on PIG-756:


In Pig 0.7 we have moved local mode of Pig to local mode of Hadoop.
https://issues.apache.org/jira/browse/PIG-1053

Closing issue

> UDFs should have API for transparently opening and reading files from HDFS or 
> from local file system with only relative path
> 
>
> Key: PIG-756
> URL: https://issues.apache.org/jira/browse/PIG-756
> Project: Pig
>  Issue Type: Bug
>Reporter: David Ciemiewicz
>
> I have a utility function util.INSETFROMFILE() that I pass a file name during 
> initialization.
> {code}
> define inQuerySet util.INSETFROMFILE(analysis/queries);
> A = load 'logs' using PigStorage() as ( date int, query chararray );
> B = filter A by inQuerySet(query);
> {code}
> This provides a computationally inexpensive way to effect map-side joins for 
> small sets plus functions of this style provide the ability to encapsulate 
> more complex matching rules.
> For rapid development and debugging purposes, I want this code to run without 
> modification on both my local file system when I do pig -exectype local and 
> on HDFS.
> Pig needs to provide an API for UDFs which allow them to either:
> 1) "know"  when they are in local or HDFS mode and let them open and read 
> from files as appropriate
> 2) just provide a file name and read statements and have pig transparently 
> manage local or HDFS opens and reads for the UDF
> UDFs need to read configuration information off the filesystem and it 
> simplifies the process if one can just flip the switch of -exectype local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-03-31 Thread Viraj Bhat (JIRA)
Link casting errors in POCast to actual lines numbers in Pig script
---

 Key: PIG-1345
 URL: https://issues.apache.org/jira/browse/PIG-1345
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


For the purpose of easy debugging, I would be nice to find out where  my 
warnings are coming from is in the pig script. 

The only known process is to comment out lines in the Pig script and see if 
these warnings go away.

2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26

I think this may need us to keep track of the line numbers of the Pig script 
(via out javacc parser) and maintain it in the logical and physical plan.

It would help users in debugging simple errors/warning related to casting.

Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?

Do we need to change the parser to something other than javacc to make this 
task simpler?

"Standardize on Parser and Scanner Technology"

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-03-30 Thread Viraj Bhat (JIRA)
pig_log file missing even though Main tells it is creating one and an M/R job 
fails 


 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


There is a particular case where I was running with the latest trunk of Pig.

{code}
$java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig

[main] INFO  org.apache.pig.Main - Logging error messages to: 
/homes/viraj/pig_1263420012601.log

$ls -l pig_1263420012601.log
ls: pig_1263420012601.log: No such file or directory
{code}

The job failed and the log file did not contain anything, the only way to debug 
was to look into the Jobtracker logs.

Here are some reasons which would have caused this behavior:
1) The underlying filer/NFS had some issues. In that case do we not error on 
stdout?
2) There are some errors from the backend which are not being captured

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-03-30 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1341:


Component/s: impl
Summary: Cannot convert DataByeArray to Chararray and results in 
FIELD_DISCARDED_TYPE_CONVERSION_FAILED  (was: Cannot convert DataByeArray to 
Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20)

> Cannot convert DataByeArray to Chararray and results in 
> FIELD_DISCARDED_TYPE_CONVERSION_FAILED
> --
>
> Key: PIG-1341
> URL: https://issues.apache.org/jira/browse/PIG-1341
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>
> Script reads in BinStorage data and tries to convert a column which is in 
> DataByteArray to Chararray. 
> {code}
> raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
> --filter out null columns
> A = filter raw by col1#'bcookie' is not null;
> B = foreach A generate col1#'bcookie'  as reqcolumn;
> describe B;
> --B: {regcolumn: bytearray}
> X = limit B 5;
> dump X;
> B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
> describe B;
> --B: {convertedcol: chararray}
> X = limit B 5;
> dump X;
> {code}
> The first dump produces:
> (36co9b55onr8s)
> (36co9b55onr8s)
> (36hilul5oo1q1)
> (36hilul5oo1q1)
> (36l4cj15ooa8a)
> The second dump produces:
> ()
> ()
> ()
> ()
> ()
> It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
> time(s).
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20

2010-03-30 Thread Viraj Bhat (JIRA)
Cannot convert DataByeArray to Chararray and results in 
FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20
-

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat


Script reads in BinStorage data and tries to convert a column which is in 
DataByteArray to Chararray. 

{code}
raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
--filter out null columns
A = filter raw by col1#'bcookie' is not null;

B = foreach A generate col1#'bcookie'  as reqcolumn;
describe B;
--B: {regcolumn: bytearray}
X = limit B 5;
dump X;

B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
describe B;
--B: {convertedcol: chararray}
X = limit B 5;
dump X;

{code}

The first dump produces:

(36co9b55onr8s)
(36co9b55onr8s)
(36hilul5oo1q1)
(36hilul5oo1q1)
(36l4cj15ooa8a)

The second dump produces:
()
()
()
()
()

It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
time(s).
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1339) International characters in column names not supported

2010-03-30 Thread Viraj Bhat (JIRA)
International characters in column names not supported
--

 Key: PIG-1339
 URL: https://issues.apache.org/jira/browse/PIG-1339
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


There is a particular use-case in which someone specifies a column name to be 
in International characters.

{code}
inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
describe inputdata;
dump inputdata;
{code}
==
Pig Stack Trace
---
ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
Encountered: "\u3042" (12354), after : ""

org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, 
column 64.  Encountered: "\u3042" (12354), after : ""

at 
org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:391)
==

Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]

2010-03-18 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1308:


Description: 
Simple script fails to read files from BinStorage() and fails to submit jobs to 
JobTracker. This occurs with trunk and not with Pig 0.6 branch.

{code}
data = load 'binstoragesample' using BinStorage() as (s, m, l);
A = foreach ULT generate   s#'key' as value;
X = limit A 20;
dump X;
{code}

When this script is submitted to the Jobtracker, we found the following error:
2010-03-18 22:31:22,296 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:01,574 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:43,276 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:33:21,743 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:02,004 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:43,442 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:35:25,907 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:36:07,402 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:36:48,596 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:37:28,014 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:38:04,823 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:38:38,981 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:39:12,220 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2

Stack Trace revelead 

at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144)
at 
org.apache.pig.impl.io.ReadToEndLoader.(ReadToEndLoader.java:115)
at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404)
at 
org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167)
at 
org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263)
at 
org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112)
at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76)
at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216)
at org.apache.pig.PigServer.compileLp(PigServer.java:883)
at org.apache.pig.PigServer.store(PigServer.java:564)

The binstorage data was generated from 2 datasets using limit and union:
{code}
Large1 = load 'input1'  using PigStorage();
Large2 = load 'input2' using PigStorage();
V = limit Large1 1;
C = limit Large2 1;
U = union V, C;
store U into 'binstoragesample' using BinStorage();
{code}

  was:
Simple script fails to read files from BinStorage() and fails to submit jobs to 
JobTracker. This occurs with trunk and not with Pig 0.6 branch.

{code}
data = load 'binstorage' using BinStorage() as (s, m, l);
A = foreach ULT generate   s#'key' as value;
X = limit A 20;
dump X;
{code}

When this script is submitted to the Jobtracker, we found the following error:
2010-03-18 22:31:22,296 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:01,574 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:43,276 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:33:21,743 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:02,004 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:43,442 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:35:25,907 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total i

[jira] Created: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]

2010-03-18 Thread Viraj Bhat (JIRA)
Inifinite loop in JobClient when reading from BinStorage Message: 
[org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2]


 Key: PIG-1308
 URL: https://issues.apache.org/jira/browse/PIG-1308
 Project: Pig
  Issue Type: Bug
Reporter: Viraj Bhat
 Fix For: 0.7.0


Simple script fails to read files from BinStorage() and fails to submit jobs to 
JobTracker. This occurs with trunk and not with Pig 0.6 branch.

{code}
data = load 'binstorage' using BinStorage() as (s, m, l);
A = foreach ULT generate   s#'key' as value;
X = limit A 20;
dump X;
{code}

When this script is submitted to the Jobtracker, we found the following error:
2010-03-18 22:31:22,296 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:01,574 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:32:43,276 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:33:21,743 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:02,004 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:34:43,442 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:35:25,907 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:36:07,402 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:36:48,596 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:37:28,014 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:38:04,823 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:38:38,981 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2
2010-03-18 22:39:12,220 [main] INFO  
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to 
process : 2

Stack Trace revelead 

at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144)
at 
org.apache.pig.impl.io.ReadToEndLoader.(ReadToEndLoader.java:115)
at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404)
at 
org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167)
at 
org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263)
at 
org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112)
at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210)
at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76)
at 
org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216)
at org.apache.pig.PigServer.compileLp(PigServer.java:883)
at org.apache.pig.PigServer.store(PigServer.java:564)

The binstorage data was generated from 2 datasets using limit and union:
{code}
Large1 = load 'input1'  using PigStorage();
Large2 = load 'input2' using PigStorage();
V = limit Large1 1;
C = limit Large2 1;
U = union V, C;
store U into 'mobilesample' using BinStorage();
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1305) Document in Load statement syntax that Pig and underlying M/R does not handle concatenated bz2 and gz files correctly

2010-03-17 Thread Viraj Bhat (JIRA)
Document  in Load statement syntax that Pig and underlying M/R does not handle 
concatenated bz2 and gz files correctly
--

 Key: PIG-1305
 URL: https://issues.apache.org/jira/browse/PIG-1305
 Project: Pig
  Issue Type: Bug
  Components: documentation
Reporter: Viraj Bhat
 Fix For: 0.7.0


The Pig Reference Manual needs to be updated:

Relational Operators

Syntax:

LOAD 'data' [USING function] [AS schema];

'data' 

Please note:
Pig reads in both bz2 and gz formats correctly as long as they are not 
concatenated gzip or bz2 generated in this manner. cat *.bz2  > 
text/concat.bz2. Your M/R jobs may succeed but the results will not be accurate.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1304) Fail underlying M/R jobs when concatenated gzip and bz2 files are provided as input

2010-03-17 Thread Viraj Bhat (JIRA)
Fail underlying M/R jobs when concatenated gzip and bz2 files are provided as 
input
---

 Key: PIG-1304
 URL: https://issues.apache.org/jira/browse/PIG-1304
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Viraj Bhat


I have the following txt files which are bzipped: \t = 
{code}
$ bzcat A.txt.bz2 
1\ta
2\taa

$bzcat B.txt.bz2
1\tb
2\tbb

$cat *.bz2 > test/mymerge.bz2
$bzcat test/mymerge.bz2 
1\ta
2\taa
1\tb
2\tbb

$hadoop fs -put test/mymerge.bz2 /user/viraj

{code}

I now write a Pig script to print values of bz2.

{code}
A = load '/user/viraj/bzipgetmerge/mymerge.bz2' using PigStorage();
dump A;
{code}

I get the records for the first bz2 file which I concatenated.

(1,a)
(2,aa)

My M/R jobs do not fail or throw any warning about this, just that it drops 
records. Is there a way we can throw a warning or fail the underlying Map job, 
can it be done in Bzip2TextInputFormat class in Pig ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1281) Detect org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple type of errors at Compile Type during creation of logical plan

2010-03-05 Thread Viraj Bhat (JIRA)
Detect org.apache.pig.data.DataByteArray cannot be cast to 
org.apache.pig.data.Tuple type of errors at Compile Type during creation of 
logical plan
---

 Key: PIG-1281
 URL: https://issues.apache.org/jira/browse/PIG-1281
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


This is more of an enhancement request, where we can detect simple errors 
during compile time during creation of Logical plan rather than at the backend.

I created a script which contains an error which gets detected in the backend 
as a cast error when in fact we can detect it in the front end(group is a 
single element so group.$0 projection operation will not work).

{code}
inputdata = LOAD '/user/viraj/mymapdata' AS (co1, col2, col3, col4);

projdata = FILTER inputdata BY (col1 is not null);

groupprojdata = GROUP projdata BY col1;

cleandata = FOREACH groupprojdata {
 bagproj = projdata.col1;
 dist_bags = DISTINCT bagproj;
 GENERATE group.$0 as newcol1, COUNT(dist_bags) as newcol2;
  };

cleandata1 = GROUP cleandata by newcol2;

cleandata2 = FOREACH cleandata1 { GENERATE group.$0 as finalcol1, 
COUNT(cleandata.newcol1) as finalcol2; };

ordereddata = ORDER cleandata2 by finalcol2;

store into 'finalresult' using PigStorage();
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1278) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableFloatWritable, recieved org.apache.pig.impl.io.NullableText

2010-03-05 Thread Viraj Bhat (JIRA)
Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableFloatWritable, recieved 
org.apache.pig.impl.io.NullableText 
---

 Key: PIG-1278
 URL: https://issues.apache.org/jira/browse/PIG-1278
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I have a script which uses Map data, and runs a UDF, which creates random 
numbers and then orders the data by these random numbers.

{code}
REGISTER myloader.jar;
--jar produced from the source code listed below
REGISTER math.jar;

DEFINE generator math.Random();

inputdata = LOAD '/user/viraj/mymapdata'   USING MyMapLoader()AS (s:map[], 
m:map[], l:map[]);

queries = FILTER inputdata   BY m#'key'#'query' IS NOT null;

queries_rand = FOREACH queries  GENERATE generator('') AS rand_num, (CHARARRAY) 
m#'key'#'query' AS query_string;

queries_sorted = ORDER queries_rand  BY rand_num  PARALLEL 10;

queries_limit = LIMIT queries_sorted 1000;

rand_queries = FOREACH queries_limit  GENERATE query_string;

STORE rand_queries INTO 'finalresult';

{code}

UDF source for Random.java
{code}
package math;

import java.io.IOException;

/*
* Implements a random float [0,1) generator.
*/

public class Random extends EvalFunc
{
private final Random m_rand = new Random();

   public Float exec(Tuple input) throws IOException
{
   return new Float(m_rand.nextFloat());
}

public Schema outputSchema(Schema input)
{
   final String name = getSchemaName(getClass().getName(), input);
   return new Schema(new Schema.FieldSchema(name, DataType.FLOAT));
}
}
{code}

Running this script returns the following error in the Mapper
=
java.io.IOException: Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableFloatWritable, recieved 
org.apache.pig.impl.io.NullableText at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845) at 
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) 
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:109)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:255)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
org.apache.hadoop.mapred.Child.main(Child.java:159) 
=

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1272) Column pruner causes wrong results

2010-03-02 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840389#action_12840389
 ] 

Viraj Bhat commented on PIG-1272:
-

Now with Pig 0.7 or trunk we have the following error:

2010-03-02 23:35:09,349 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.NoSuchFieldError: sJobConf
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POJoinPackage.getNext(POJoinPackage.java:110)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:380)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:363)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:240)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:409)
at org.apache.hadoop.mapred.Child.main(Child.java:159)

Viraj

> Column pruner causes wrong results
> --
>
> Key: PIG-1272
> URL: https://issues.apache.org/jira/browse/PIG-1272
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.7.0
>
>
> For a simple script the column pruner optimization removes certain columns 
> from the original relation, which results in wrong results.
> Input file "kv" contains the following columns (tab separated)
> {code}
> a   1
> a   2
> a   3
> b   4
> c   5
> c   6
> b   7
> d   8
> {code}
> Now running this script in Pig 0.6 produces
> {code}
> kv = load 'kv' as (k,v);
> keys= foreach kv generate k;
> keys = distinct keys; 
> keys = limit keys 2;
> rejoin = join keys by k, kv by k;
> dump rejoin;
> {code}
> (a,a)
> (a,a)
> (a,a)
> (b,b)
> (b,b)
> Running this in Pig 0.5 version without column pruner results in:
> (a,a,1)
> (a,a,2)
> (a,a,3)
> (b,b,4)
> (b,b,7)
> When we disable the "ColumnPruner" optimization it gives right results.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1272) Column pruner causes wrong results

2010-03-02 Thread Viraj Bhat (JIRA)
Column pruner causes wrong results
--

 Key: PIG-1272
 URL: https://issues.apache.org/jira/browse/PIG-1272
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


For a simple script the column pruner optimization removes certain columns from 
the original relation, which results in wrong results.

Input file "kv" contains the following columns (tab separated)
{code}
a   1
a   2
a   3
b   4
c   5
c   6
b   7
d   8
{code}

Now running this script in Pig 0.6 produces

{code}
kv = load 'kv' as (k,v);
keys= foreach kv generate k;
keys = distinct keys; 
keys = limit keys 2;
rejoin = join keys by k, kv by k;
dump rejoin;
{code}

(a,a)
(a,a)
(a,a)
(b,b)
(b,b)


Running this in Pig 0.5 version without column pruner results in:
(a,a,1)
(a,a,2)
(a,a,3)
(b,b,4)
(b,b,7)

When we disable the "ColumnPruner" optimization it gives right results.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

2010-03-02 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840339#action_12840339
 ] 

Viraj Bhat commented on PIG-1252:
-

A modified version of the script works, does this have to do with nested 
foreach? 

{code}
loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
 
dump finalData;
{code}

> Diamond splitter does not generate correct results when using Multi-query 
> optimization
> --
>
> Key: PIG-1252
> URL: https://issues.apache.org/jira/browse/PIG-1252
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: Richard Ding
> Fix For: 0.7.0
>
>
> I have script which uses split but somehow does not use one of the split 
> branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using 
> org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
> col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
> (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
> ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 
> : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
> falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>orderedData = ORDER trueDataTmp BY col1,col2;
>GENERATE FLATTEN ( MYUDF (orderedData, 60, 
> 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>   }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result. 
>  This could be the result of complex BinCond's in the POLoad. We can get rid 
> of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1263) Script producing varying number of records when COGROUPing value of map data type with and without types

2010-02-25 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1263:


Description: 
I have a Pig script which I am experimenting upon. [[Albeit this is not 
optimized and can be done in variety of ways]] I get different record counts by 
placing load store pairs in the script.

Case 1: Returns 424329 records
Case 2: Returns 5859 records
Case 3: Returns 5859 records
Case 4: Returns 5578 records
I am wondering what the correct result is?

Here are the scripts.
Case 1: 
{code}
register udf.jar

A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);

B = FOREACH A GENERATE
s#'key1' as key1,
s#'key2' as key2;

C = FOREACH B generate key2;

D = filter C by (key2 IS NOT null);

E = distinct D;

store E into 'unique_key_list' using PigStorage('\u0001');

F = Foreach E generate key2, MapGenerate(key2) as m;

G = FILTER F by (m IS NOT null);

H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, 
m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, 
m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12;

I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 
group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, 
group.id12 as id12;

--load previous days data
K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, 
id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

L = COGROUP  K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER,
 J by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER;

M = filter L by IsEmpty(K);

store M into 'cogroupNoTypes' using PigStorage();
{code}

Case 2:  Storing and loading intermediate results in J 
{code}
register udf.jar

A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);

B = FOREACH A GENERATE
s#'key1' as key1,
s#'key2' as key2;

C = FOREACH B generate key2;

D = filter C by (key2 IS NOT null);

E = distinct D;

store E into 'unique_key_list' using PigStorage('\u0001');

F = Foreach E generate key2, MapGenerate(key2) as m;

G = FILTER F by (m IS NOT null);

H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, 
m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, 
m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12;

I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 
group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, 
group.id12 as id12;

--store intermediate data to HDFS and re-read
store J into 'output/20100203/J' using PigStorage('\u0001');

--load previous days data
K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, 
id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

--read J into K1
K1 = LOAD 'output/20100203/J' using PigStorage('\u0001') as (id1, id2, id3, 
id4, id5, id6, id7, id8, id9, id10, id11, id12);

L = COGROUP  K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER,
 K1 by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER;

M = filter L by IsEmpty(K);

store M into 'cogroupNoTypesIntStore' using PigStorage();
{code}


Case 3: Types information specified but no intermediate store of J

{code}
register udf.jar

A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);

B = FOREACH A GENERATE
s#'key1' as key1,
s#'key2' as key2;

C = FOREACH B generate key2;

D = filter C by (key2 IS NOT null);

E = distinct D;

store E into 'unique_key_list' using PigStorage('\u0001');

F = Foreach E generate key2, MapGenerate(key2) as m;

G = FILTER F by (m IS NOT null);

H = foreach G generate key2, (long)m#'id1' as id1, (long)m#'id2' as id2, 
(long)m#'id3' as id3, (long)m#'id4' as id4, (long)m#'id5' as id5, (long)m#'id6' 
as id6, (long)m#'id7' as id7, (chararray)m#'id8' as id8, (chararray)m#'id9' as 
id9, (chararray)m#'id10' as id10, (chararray)m#'id11' as id11, 
(chararray)m#'id12' as id12;


I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 
group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, 
group.id12 as id12;

store J into 'output/20100203/J' using PigStorage('\u0001');

--load previous days data with type information
K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as  
(

[jira] Created: (PIG-1263) Script producing varying number of records when COGROUPing value of map data type with and without types

2010-02-25 Thread Viraj Bhat (JIRA)
Script producing varying number of records when COGROUPing value of map data 
type with and without types


 Key: PIG-1263
 URL: https://issues.apache.org/jira/browse/PIG-1263
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have a Pig script which I am experimenting upon. [[Albeit this is not 
optimized and can be done in variety of ways]] I get different record counts by 
placing load store pairs in the script.

Case 1: Returns 424329 records
Case 2: Returns 5859 records
Case 3: Returns 5859 records
Case 4: Returns 5578 records
I am wondering what the correct result is?

Here are the scripts.
Case 1: 
{code}
register udf.jar

A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);

B = FOREACH A GENERATE
s#'key1' as key1,
s#'key2' as key2;

C = FOREACH B generate key2;

D = filter C by (key2 IS NOT null);

E = distinct D;

store E into 'unique_key_list' using PigStorage('\u0001');

F = Foreach E generate key2, MapGenerate(key2) as m;

G = FILTER F by (m IS NOT null);

H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, 
m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, 
m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12;

I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 
group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, 
group.id12 as id12;

--load previous days data
K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, 
id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

L = COGROUP  K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER,
 J by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER;

M = filter L by IsEmpty(K);

store M into 'cogroupNoTypes' using PigStorage();
{code}

Case 2:  Storing and loading intermediate results in J 
{code}
register udf.jar

A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);

B = FOREACH A GENERATE
s#'key1' as key1,
s#'key2' as key2;

C = FOREACH B generate key2;

D = filter C by (key2 IS NOT null);

E = distinct D;

store E into 'unique_key_list' using PigStorage('\u0001');

F = Foreach E generate key2, MapGenerate(key2) as m;

G = FILTER F by (m IS NOT null);

H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, 
m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, 
m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12;

I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 
group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, 
group.id12 as id12;

--store intermediate data to HDFS and re-read
store J into 'output/20100203/J' using PigStorage('\u0001');

--load previous days data
K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, 
id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

--read J into K1
K1 = LOAD 'output/20100203/J' using PigStorage('\u0001') as (id1, id2, id3, 
id4, id5, id6, id7, id8, id9, id10, id11, id12);

L = COGROUP  K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER,
 K1 by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, 
id12) OUTER;

M = filter L by IsEmpty(K);

store M into 'cogroupNoTypesIntStore' using PigStorage();
{code}


Case 3: Types information specified but no intermediate store of J

{code}
register udf.jar

A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l);

B = FOREACH A GENERATE
s#'key1' as key1,
s#'key2' as key2;

C = FOREACH B generate key2;

D = filter C by (key2 IS NOT null);

E = distinct D;

store E into 'unique_key_list' using PigStorage('\u0001');

F = Foreach E generate key2, MapGenerate(key2) as m;

G = FILTER F by (m IS NOT null);

H = foreach G generate key2, (long)m#'id1' as id1, (long)m#'id2' as id2, 
(long)m#'id3' as id3, (long)m#'id4' as id4, (long)m#'id5' as id5, (long)m#'id6' 
as id6, (long)m#'id7' as id7, (chararray)m#'id8' as id8, (chararray)m#'id9' as 
id9, (chararray)m#'id10' as id10, (chararray)m#'id11' as id11, 
(chararray)m#'id12' as id12;


I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12);

J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, 
group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, 

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

2010-02-22 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1252:


Description: 
I have script which uses split but somehow does not use one of the split 
branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
   orderedData = ORDER trueDataTmp BY col1,col2;
   GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
  }

dump finalData;

{code}


You can see that "falseDataTmp" is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  
This could be the result of complex BinCond's in the POLoad. We can get rid of 
this error by using  FILTER instead of SPIT.

Viraj

  was:
I have script which uses split but somehow does not use one of the split 
branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
   orderedData = ORDER trueDataTmp BY col1,col2;
   GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
  }

dump finalData;

{code}


You can see that "falseDataTmp" is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  
This could be the result of complex BinCond's in the POLoad. We can get rid of 
this error by using  FILTER instead of SPIT.

Viraj


> Diamond splitter does not generate correct results when using Multi-query 
> optimization
> --
>
> Key: PIG-1252
> URL: https://issues.apache.org/jira/browse/PIG-1252
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.7.0
>
>
> I have script which uses split but somehow does not use one of the split 
> branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using 
> org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
> col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
> (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
> ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 
> : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
> falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>orderedData = ORDER trueDataTmp BY col1,col2;
>GENERATE FLATTEN ( MYUDF (orderedData, 60, 
> 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>   }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result. 
>  This could be the result of complex BinCond's in the POLoad. We can get rid 
> of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

2010-02-22 Thread Viraj Bhat (JIRA)
Diamond splitter does not generate correct results when using Multi-query 
optimization
--

 Key: PIG-1252
 URL: https://issues.apache.org/jira/browse/PIG-1252
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I have script which uses split but somehow does not use one of the split 
branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using 
org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, 
col7, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, 
(chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : 
((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : 
IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), 
falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
   orderedData = ORDER trueDataTmp BY col1,col2;
   GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 
'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
  }

dump finalData;

{code}


You can see that "falseDataTmp" is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  
This could be the result of complex BinCond's in the POLoad. We can get rid of 
this error by using  FILTER instead of SPIT.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1247) Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error

2010-02-19 Thread Viraj Bhat (JIRA)
Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error
-

 Key: PIG-1247
 URL: https://issues.apache.org/jira/browse/PIG-1247
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I have a large script in which there are intermediate stores statements, one of 
them writes to a directory I do not have permission to write to. 

The stack trace I get from Pig is this:

2010-02-20 02:16:32,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2999: Unexpected internal error. 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error

Details at logfile: /home/viraj/pig_1266632145355.log

Pig Stack Trace
---

ERROR 2999: Unexpected internal error. 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error
java.lang.ClassCastException: 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3583)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1407)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:762)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:386)


The only way to find the error was to look at the javacc generated 
QueryParser.java code and do a System.out.println()


Here is a script to reproduce the problem:

{code}
A = load '/user/viraj/three.txt' using PigStorage();
B = foreach A generate ['a'#'12'] as b:map[] ;
store B into '/user/secure/pigtest' using PigStorage();
{code}

"three.txt" has 3 lines which contain nothing but the number "1".

{code}
$ hadoop fs -ls /user/secure/

ls: could not get get listing for 'hdfs://mynamenode/user/secure' : 
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=viraj, access=READ_EXECUTE, inode="secure":secure:users:rwx--

{code}


Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1243) Passing Complex map types to and from streaming causes a problem

2010-02-18 Thread Viraj Bhat (JIRA)
Passing Complex map types to and from streaming causes a problem


 Key: PIG-1243
 URL: https://issues.apache.org/jira/browse/PIG-1243
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I have a program which generates different types of Maps fields and stores it 
into PigStorage.
{code}
A = load '/user/viraj/three.txt' using PigStorage();

B = foreach A generate ['a'#'12'] as b:map[], ['b'#['c'#'12']] as c, 
['c'#{(['d'#'15']),(['e'#'16'])}] as d;

store B into '/user/viraj/pigtest' using PigStorage();
{code}

Now I test the previous output in the below script to make sure I have the 
right results. I also pass this data to a Perl script and I observe that the 
complex Map types I have generated, are lost when I get the result back.

{code}
DEFINE CMD `simple.pl` SHIP('simple.pl');

A = load '/user/viraj/pigtest' using PigStorage() as (simpleFields, mapFields, 
mapListFields);

B = foreach A generate $0, $1, $2;

dump B;

C = foreach A generate  (chararray)simpleFields#'a' as value, $0,$1,$2;

D = stream C through CMD as (a0:map[], a1:map[], a2:map[]);

dump D;
{code}


dumping B results in:

([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}])
([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}])
([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}])

dumping D results in:

([a#12],,)
([a#12],,)
([a#12],,)

The Perl script used here is:
{code}
#!/usr/local/bin/perl

use warnings;

use strict;

while(<>) {

my($bc,$s,$m,$l)=split/\t/;

print("$s\t$m\t$l");

}
{code}

Is there an issue with handling of complex Map fields within streaming? How can 
I fix this to obtain the right result?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1194) ERROR 2055: Received Error while processing the map plan

2010-02-10 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat reopened PIG-1194:
-


Hi Richard,
 I ran the script attached on the ticket and found out that the map tasks fails 
with the following error:

org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received 
Error while processing the map plan. at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:281)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
 at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at 
org.apache.hadoop.mapred.Child.main(Child.java:159) 

I am using the latest pig.jar without hadoop.
Viraj

> ERROR 2055: Received Error while processing the map plan
> 
>
> Key: PIG-1194
> URL: https://issues.apache.org/jira/browse/PIG-1194
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.5.0, 0.6.0
>Reporter: Viraj Bhat
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: inputdata.txt, PIG-1194.patch, PIG-1194.patch
>
>
> I have a simple Pig script which takes 3 columns out of which one is null. 
> {code}
> input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3);
> a = GROUP input BY (((double) col3)/((double) col2) > .001 OR col1 < 11 ? 
> col1 : -1);
> b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, 
> SUM(input.col3) as  col3;
> store b into 'finalresult';
> {code}
> When I run this script I get the following error:
> ERROR 2055: Received Error while processing the map plan.
> org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received 
> Error while processing the map plan.
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 
> A more useful error message for the purpose of debugging would be helpful.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-08 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831251#action_12831251
 ] 

Viraj Bhat commented on PIG-1131:
-

Ashutosh I was able to recreate a similar problem using the trunk. 

java -cp pig-withouthadoop.jar org.apache.pig.Main -version


Apache Pig version 0.7.0-dev (r907874) 

compiled Feb 08 2010, 17:35:04

Viraj

> Pig simple join does not work when it contains empty lines
> --
>
> Key: PIG-1131
> URL: https://issues.apache.org/jira/browse/PIG-1131
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: junk1.txt, junk2.txt, simplejoinscript.pig
>
>
> I have a simple script, which does a JOIN.
> {code}
> input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
> describe input1;
> input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
> describe input2;
> joineddata = JOIN input1 by $0, input2 by $0;
> describe joineddata;
> store joineddata into 'result';
> {code}
> The input data contains empty lines.  
> The join fails in the Map phase with the following error in the 
> PRLocalRearrange.java
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>   at java.util.ArrayList.get(ArrayList.java:322)
>   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>   at org.apache.hadoop.mapred.Child.main(Child.java:159)
> I am surprised that the test cases did not detect this error. Could we add 
> this data which contains empty lines to the testcases?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines

2010-02-08 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831248#action_12831248
 ] 

Viraj Bhat commented on PIG-1131:
-

Olga I marked it as critical since we mention that Pig can eat any type of 
data, and the example script shows that we need data with fixed schema's and to 
perform a simple join.

Viraj

> Pig simple join does not work when it contains empty lines
> --
>
> Key: PIG-1131
> URL: https://issues.apache.org/jira/browse/PIG-1131
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0
>
> Attachments: junk1.txt, junk2.txt, simplejoinscript.pig
>
>
> I have a simple script, which does a JOIN.
> {code}
> input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
> describe input1;
> input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
> describe input2;
> joineddata = JOIN input1 by $0, input2 by $0;
> describe joineddata;
> store joineddata into 'result';
> {code}
> The input data contains empty lines.  
> The join fails in the Map phase with the following error in the 
> PRLocalRearrange.java
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>   at java.util.ArrayList.get(ArrayList.java:322)
>   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>   at org.apache.hadoop.mapred.Child.main(Child.java:159)
> I am surprised that the test cases did not detect this error. Could we add 
> this data which contains empty lines to the testcases?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1220) Document unknown keywords as missing or to do in future

2010-02-03 Thread Viraj Bhat (JIRA)
Document unknown keywords as missing or to do in future
---

 Key: PIG-1220
 URL: https://issues.apache.org/jira/browse/PIG-1220
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


To get help at the grunt shell I do the following:

grunt>touchz

010-02-04 00:59:28,714 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Encountered "  "touchz "" at line 1, 
column 1.
Was expecting one of:
 
"cat" ...
"fs" ...
"cd" ...
"cp" ...
"copyFromLocal" ...
"copyToLocal" ...
"dump" ...
"describe" ...
"aliases" ...
"explain" ...
"help" ...
"kill" ...
"ls" ...
"mv" ...
"mkdir" ...
"pwd" ...
"quit" ...
"register" ...
"rm" ...
"rmf" ...
"set" ...
"illustrate" ...
"run" ...
"exec" ...
"scriptDone" ...
"" ...
 ...
";" ...

I looked at the code and found that we do nothing at:

"scriptDone": Is there some future value of that command.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1211) Pig script runs half way after which it reports syntax error

2010-01-28 Thread Viraj Bhat (JIRA)
Pig script runs half way after which it reports syntax error


 Key: PIG-1211
 URL: https://issues.apache.org/jira/browse/PIG-1211
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


I have a Pig script which is structured in the following way

{code}
register cp.jar

dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
col3, col4, col5);

filtered_dataset = filter dataset by (col1 == 1);

proj_filtered_dataset = foreach filtered_dataset generate col2, col3;

rmf $output1;

store proj_filtered_dataset into '$output1' using PigStorage();

second_stream = foreach filtered_dataset  generate col2, col4, col5;

group_second_stream = group second_stream by col4;

output2 = foreach group_second_stream {
 a =  second_stream.col2
 b =   distinct second_stream.col5;
 c = order b by $0;
 generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
}

rmf  $output2;

--syntax error here
store output2 to '$output2' using PigStorage();

{code}

I run this script using the Multi-query option, it runs successfully till the 
first store but later fails with a syntax error. 

The usage of HDFS option, "rmf" causes the first store to execute. 

The only option the I have is to run an explain before running his script 

grunt> explain -script myscript.pig -out explain.out

or moving the rmf statements to the top of the script

Here are some questions:

a) Can we have an option to do something like "checkscript" instead of explain 
to get the same syntax error?  In this way I can ensure that I do not run for 
3-4 hours before encountering a syntax error
b) Can pig not figure out a way to re-order the rmf statements since all the 
store directories are variables

Thanks
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-531) Way for explain to show 1 plan at a time

2010-01-27 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-531:
---

Fix Version/s: 0.5.0

Hi Olga,
 I think we have a way to handle it in multi-query optimization. Is it 
reasonable to close this as fixed.

I see the following in the Multi-query document about explain:

http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

explain [-out ] [-brief] [-dot] [-param =]* [-param_file 
]* [-script ] []

Viraj

> Way for explain to show 1 plan at a time
> 
>
> Key: PIG-531
> URL: https://issues.apache.org/jira/browse/PIG-531
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Fix For: 0.5.0
>
>
> Several users complained that EXPLAIN output is too verbose and is hard to 
> make sense of.
> One way to improve the situation is to realize is that EXPLAIN actually 
> contains several plans: logical, physical, backend specific. So we can update 
> EXPLAIN to allow to show a particular plan. For instance
> EXPLAIN LOGICAL A;
> would show only logical plan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig

2010-01-27 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-940:
---

Affects Version/s: (was: 0.3.0)
   0.5.0
Fix Version/s: 0.7.0

> Cross site HDFS access using the default.fs.name not possible in Pig
> 
>
> Key: PIG-940
> URL: https://issues.apache.org/jira/browse/PIG-940
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.5.0
> Environment: Hadoop 20
>Reporter: Viraj Bhat
> Fix For: 0.7.0
>
>
> I have a script which does the following.. access data from a remote HDFS 
> location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I 
> do not want to copy this huge amount of data between HDFS locations]].
> However I want my Pigscript  to write data to the HDFS running on 
> localmachine.company.com.
> Currently Pig does not support that behavior and complains that: 
> "hdfs://localmachine.company.com/user/viraj/A1.txt does not exist"
> {code}
> A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); 
> B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); 
> C = JOIN A by a, B by c; 
> store C into 'output' using PigStorage();  
> {code}
> ===
> 2009-09-01 00:37:24,032 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to hadoop file system at: hdfs://localmachine.company.com:8020
> 2009-09-01 00:37:24,277 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to map-reduce job tracker at: localmachine.company.com:50300
> 2009-09-01 00:37:24,567 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer
>  - Rewrite: POPackage->POForEach to POJoinPackage
> 2009-09-01 00:37:24,573 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size before optimization: 1
> 2009-09-01 00:37:24,573 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size after optimization: 1
> 2009-09-01 00:37:26,197 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>  - Setting up single store job
> 2009-09-01 00:37:26,249 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
> Use GenericOptionsParser for parsing the arguments. Applications should 
> implement Tool for the same.
> 2009-09-01 00:37:26,746 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 0% complete
> 2009-09-01 00:37:26,746 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 100% complete
> 2009-09-01 00:37:26,747 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 1 map reduce job(s) failed!
> 2009-09-01 00:37:26,756 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Failed to produce result in: 
> "hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480"
> 2009-09-01 00:37:26,756 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Failed!
> 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
> Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log
> ===
> The error file in Pig contains:
> ===
> ERROR 2998: Unhandled internal error. 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2100: 
> hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
> at 
> org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126)
> at 
> org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
> at 
> org.apache.pig.impl.io.ValidatingInputFileSpec.(ValidatingInputFileSpec.java:44)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>   

[jira] Updated: (PIG-1174) Creation of output path should be done by storage function

2010-01-27 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1174:


Fix Version/s: 0.7.0

> Creation of output path should be done by storage function
> --
>
> Key: PIG-1174
> URL: https://issues.apache.org/jira/browse/PIG-1174
> Project: Pig
>  Issue Type: Bug
>Reporter: Bill Graham
> Fix For: 0.7.0
>
>
> When executing a STORE command, Pig creates the output location before the 
> storage function gets called. This causes problems with storage functions 
> that have logic to determine the output location. See this thread:
> http://www.mail-archive.com/pig-user%40hadoop.apache.org/msg01538.html
> For example, when making a request like this:
> STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', 
> 'none', '\t');
> Pig creates a file '/my/home/output' and then an exception is thrown when 
> MultiStorage tries to make a directory under '/my/home/output'. The 
> workaround is to instead specify a dummy location as the first path like so:
> STORE A INTO '/my/home/output/temp' USING MultiStorage('/my/home/output','0', 
> 'none', '\t');
> Two changes should be made:
> 1. The path specified in the INTO clause should be available to the storage 
> function so it doesn't need to be duplicated.
> 2. The creation of the output paths should be delegated to the storage 
> function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1194) ERROR 2055: Received Error while processing the map plan

2010-01-15 Thread Viraj Bhat (JIRA)
ERROR 2055: Received Error while processing the map plan


 Key: PIG-1194
 URL: https://issues.apache.org/jira/browse/PIG-1194
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0, 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0
 Attachments: inputdata.txt

I have a simple Pig script which takes 3 columns out of which one is null. 
{code}

input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3);
a = GROUP input BY (((double) col3)/((double) col2) > .001 OR col1 < 11 ? col1 
: -1);
b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, SUM(input.col3) 
as  col3;
store b into 'finalresult';

{code}


When I run this script I get the following error:

ERROR 2055: Received Error while processing the map plan.

org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received 
Error while processing the map plan.

at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277)

at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)

at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)



A more useful error message for the purpose of debugging would be helpful.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1194) ERROR 2055: Received Error while processing the map plan

2010-01-15 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1194:


Attachment: inputdata.txt

Testdata to run with this script

> ERROR 2055: Received Error while processing the map plan
> 
>
> Key: PIG-1194
> URL: https://issues.apache.org/jira/browse/PIG-1194
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.5.0, 0.6.0
>Reporter: Viraj Bhat
>Assignee: Richard Ding
> Fix For: 0.6.0
>
> Attachments: inputdata.txt
>
>
> I have a simple Pig script which takes 3 columns out of which one is null. 
> {code}
> input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3);
> a = GROUP input BY (((double) col3)/((double) col2) > .001 OR col1 < 11 ? 
> col1 : -1);
> b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, 
> SUM(input.col3) as  col3;
> store b into 'finalresult';
> {code}
> When I run this script I get the following error:
> ERROR 2055: Received Error while processing the map plan.
> org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received 
> Error while processing the map plan.
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 
> A more useful error message for the purpose of debugging would be helpful.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified

2010-01-14 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800315#action_12800315
 ] 

Viraj Bhat commented on PIG-1187:
-

Hi Jeff,
 This is specific to the data we are using and it looks like parser failed when 
it is trying to interpret some characters. As such we have tested this with 
Chinese characters and it works.
Viraj

> UTF-8 (international code) breaks with loader when load with schema is 
> specified
> 
>
> Key: PIG-1187
> URL: https://issues.apache.org/jira/browse/PIG-1187
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.6.0
>
>
> I have a set of Pig statements which dump an international dataset.
> {code}
> INPUT_OBJECT = load 'internationalcode';
> describe INPUT_OBJECT;
> dump INPUT_OBJECT;
> {code}
> Sample output
> (756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)})
> It works and dumps results but when I use a schema for loading it fails.
> {code}
> INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag 
> {T: tuple(label:chararray)});
> describe INPUT_OBJECT;
> {code}
> The error message is as follows:2010-01-14 02:23:27,320 FATAL 
> org.apache.hadoop.mapred.Child: Error running child : 
> org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop 
> caused by repeated empty string matches at line 1, column 21.
>   at 
> org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620)
>   at 
> org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569)
>   at 
> org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651)
>   at 
> org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152)
>   at 
> org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100)
>   at 
> org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382)
>   at 
> org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42)
>   at 
> org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68)
>   at 
> org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>   at org.apache.hadoop.mapred.Child.main(Child.java:159)
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified

2010-01-13 Thread Viraj Bhat (JIRA)
UTF-8 (international code) breaks with loader when load with schema is specified


 Key: PIG-1187
 URL: https://issues.apache.org/jira/browse/PIG-1187
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have a set of Pig statements which dump an international dataset.
{code}
INPUT_OBJECT = load 'internationalcode';
describe INPUT_OBJECT;
dump INPUT_OBJECT;
{code}

Sample output

(756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)})

It works and dumps results but when I use a schema for loading it fails.

{code}
INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag 
{T: tuple(label:chararray)});
describe INPUT_OBJECT;
{code}

The error message is as follows:2010-01-14 02:23:27,320 FATAL 
org.apache.hadoop.mapred.Child: Error running child : 
org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop 
caused by repeated empty string matches at line 1, column 21.
at 
org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620)
at 
org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569)
at 
org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651)
at 
org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152)
at 
org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100)
at 
org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382)
at 
org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42)
at 
org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68)
at 
org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-17 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792061#action_12792061
 ] 

Viraj Bhat commented on PIG-1157:
-

Hi Richard,

 Thanks for your suggestion, it works.  Additionally we could also use the 
"exec" statement before the alias E to prevent the implicit dependency.

How hard/easy is it for Pig to find out if there is an implicit dependency or 
not. Pig anyway has a copy of the logical plan in memory, where it knows that 
alias E requires output from D which is generated in the previous step.

Can we not warn the user about this implicit dependency? 

Viraj





> Sucessive replicated joins do not generate Map Reduce plan and fails due to 
> OOM
> ---
>
> Key: PIG-1157
> URL: https://issues.apache.org/jira/browse/PIG-1157
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: Richard Ding
> Fix For: 0.6.0
>
> Attachments: oomreplicatedjoin.pig, replicatedjoinexplain.log
>
>
> Hi all,
>  I have a script which does 2 replicated joins in succession. Please note 
> that the inputs do not exist on the HDFS.
> {code}
> A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
> A1 = FOREACH A GENERATE a;
> B = GROUP A1 BY a;
> C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
> D = JOIN C BY x, B BY group USING "replicated";
> E = JOIN A BY a, D by x USING "replicated";
> dump E;
> {code}
> 2009-12-16 19:12:00,253 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size before optimization: 4
> 2009-12-16 19:12:00,254 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - Merged 1 map-only splittees.
> 2009-12-16 19:12:00,254 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - Merged 1 map-reduce splittees.
> 2009-12-16 19:12:00,254 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - Merged 2 out of total 2 splittees.
> 2009-12-16 19:12:00,254 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size after optimization: 2
> 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2998: Unhandled internal error. unable to create new native thread
> Details at logfile: pig_1260990666148.log
> Looking at the log file:
> Pig Stack Trace
> ---
> ERROR 2998: Unhandled internal error. unable to create new native thread
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:597)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
> at org.apache.pig.PigServer.store(PigServer.java:522)
> at org.apache.pig.PigServer.openIterator(PigServer.java:458)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> 
> If we want to look at the explain output, we find that there is no Map Reduce 
> plan that is generated. 
>  Why is the M/R plan not generated?
> Attaching the script and explain output.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-16 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1157:


Attachment: oomreplicatedjoin.pig
replicatedjoinexplain.log

Explain output and Pig script.

> Sucessive replicated joins do not generate Map Reduce plan and fails due to 
> OOM
> ---
>
> Key: PIG-1157
> URL: https://issues.apache.org/jira/browse/PIG-1157
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
> Fix For: 0.6.0
>
> Attachments: oomreplicatedjoin.pig, replicatedjoinexplain.log
>
>
> Hi all,
>  I have a script which does 2 replicated joins in succession. Please note 
> that the inputs do not exist on the HDFS.
> {code}
> A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
> A1 = FOREACH A GENERATE a;
> B = GROUP A1 BY a;
> C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
> D = JOIN C BY x, B BY group USING "replicated";
> E = JOIN A BY a, D by x USING "replicated";
> dump E;
> {code}
> 2009-12-16 19:12:00,253 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size before optimization: 4
> 2009-12-16 19:12:00,254 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - Merged 1 map-only splittees.
> 2009-12-16 19:12:00,254 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - Merged 1 map-reduce splittees.
> 2009-12-16 19:12:00,254 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - Merged 2 out of total 2 splittees.
> 2009-12-16 19:12:00,254 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size after optimization: 2
> 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2998: Unhandled internal error. unable to create new native thread
> Details at logfile: pig_1260990666148.log
> Looking at the log file:
> Pig Stack Trace
> ---
> ERROR 2998: Unhandled internal error. unable to create new native thread
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:597)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
> at org.apache.pig.PigServer.store(PigServer.java:522)
> at org.apache.pig.PigServer.openIterator(PigServer.java:458)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> 
> If we want to look at the explain output, we find that there is no Map Reduce 
> plan that is generated. 
>  Why is the M/R plan not generated?
> Attaching the script and explain output.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-16 Thread Viraj Bhat (JIRA)
Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


Hi all,
 I have a script which does 2 replicated joins in succession. Please note that 
the inputs do not exist on the HDFS.

{code}
A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
A1 = FOREACH A GENERATE a;
B = GROUP A1 BY a;
C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
D = JOIN C BY x, B BY group USING "replicated";
E = JOIN A BY a, D by x USING "replicated";
dump E;
{code}

2009-12-16 19:12:00,253 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 4
2009-12-16 19:12:00,254 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 1 map-only splittees.
2009-12-16 19:12:00,254 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 1 map-reduce splittees.
2009-12-16 19:12:00,254 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 2 out of total 2 splittees.
2009-12-16 19:12:00,254 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 2
2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. unable to create new native thread
Details at logfile: pig_1260990666148.log

Looking at the log file:

Pig Stack Trace
---
ERROR 2998: Unhandled internal error. unable to create new native thread

java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
at org.apache.pig.PigServer.store(PigServer.java:522)
at org.apache.pig.PigServer.openIterator(PigServer.java:458)
at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:397)


If we want to look at the explain output, we find that there is no Map Reduce 
plan that is generated. 

 Why is the M/R plan not generated?


Attaching the script and explain output.
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788481#action_12788481
 ] 

Viraj Bhat commented on PIG-1144:
-

Hi Daniel,
 Thanks again for your input. This is more of a performance issue, where users 
do not detect, till they see that 1 reducer job has failed in the sort phase. 
They safely assume that the default_parallel keyword will do the trick.
Viraj

> set default_parallelism construct does not set the number of reducers 
> correctly
> ---
>
> Key: PIG-1144
> URL: https://issues.apache.org/jira/browse/PIG-1144
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
> Environment: Hadoop 20 cluster with multi-node installation
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.7.0
>
> Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
> PIG-1144-1.patch
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set 
> construct: "set default_parallel 100" . I modified the "MRPrinter.java" to 
> printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " 
> Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the 
> actual sort, runs as a single reducer job. This can be corrected, by adding 
> the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788439#action_12788439
 ] 

Viraj Bhat commented on PIG-1144:
-

Hi Daniel,
One more thing to note is that the Last Sort M/R job has a parallelism of 1. 
Should it not be -1?
Viraj

> set default_parallelism construct does not set the number of reducers 
> correctly
> ---
>
> Key: PIG-1144
> URL: https://issues.apache.org/jira/browse/PIG-1144
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
> Environment: Hadoop 20 cluster with multi-node installation
>Reporter: Viraj Bhat
> Fix For: 0.7.0
>
> Attachments: brokenparallel.out, genericscript_broken_parallel.pig
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set 
> construct: "set default_parallel 100" . I modified the "MRPrinter.java" to 
> printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " 
> Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the 
> actual sort, runs as a single reducer job. This can be corrected, by adding 
> the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788436#action_12788436
 ] 

Viraj Bhat commented on PIG-1144:
-

This happens on the real cluster, where the sorting job did not complete 
because of a single reducer. 

> set default_parallelism construct does not set the number of reducers 
> correctly
> ---
>
> Key: PIG-1144
> URL: https://issues.apache.org/jira/browse/PIG-1144
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
> Environment: Hadoop 20 cluster with multi-node installation
>Reporter: Viraj Bhat
> Fix For: 0.7.0
>
> Attachments: brokenparallel.out, genericscript_broken_parallel.pig
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set 
> construct: "set default_parallel 100" . I modified the "MRPrinter.java" to 
> printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " 
> Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the 
> actual sort, runs as a single reducer job. This can be corrected, by adding 
> the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1144:


Attachment: brokenparallel.out
genericscript_broken_parallel.pig

Script and explain output

> set default_parallelism construct does not set the number of reducers 
> correctly
> ---
>
> Key: PIG-1144
> URL: https://issues.apache.org/jira/browse/PIG-1144
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
> Environment: Hadoop 20 cluster with multi-node installation
>Reporter: Viraj Bhat
> Fix For: 0.7.0
>
> Attachments: brokenparallel.out, genericscript_broken_parallel.pig
>
>
> Hi all,
>  I have a Pig script where I set the parallelism using the following set 
> construct: "set default_parallel 100" . I modified the "MRPrinter.java" to 
> printout the parallelism
> {code}
> ...
> public void visitMROp(MapReduceOper mr)
> mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " 
> Parallelism " + mr.getRequestedParallelism());
> ...
> {code}
> When I run an explain on the script, I see that the last job which does the 
> actual sort, runs as a single reducer job. This can be corrected, by adding 
> the PARALLEL keyword in front of the ORDER BY.
> Attaching the script and the explain output
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-09 Thread Viraj Bhat (JIRA)
set default_parallelism construct does not set the number of reducers correctly
---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
 Fix For: 0.7.0


Hi all,
 I have a Pig script where I set the parallelism using the following set 
construct: "set default_parallel 100" . I modified the "MRPrinter.java" to 
printout the parallelism
{code}
...
public void visitMROp(MapReduceOper mr)
mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " 
Parallelism " + mr.getRequestedParallelism());
...
{code}

When I run an explain on the script, I see that the last job which does the 
actual sort, runs as a single reducer job. This can be corrected, by adding the 
PARALLEL keyword in front of the ORDER BY.

Attaching the script and the explain output

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines

2009-12-09 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788387#action_12788387
 ] 

Viraj Bhat commented on PIG-1131:
-

Hi Pradeep,
 So the workaround for this is for the user to specify the schema for the 
largest size tuple or which contains the maximum number of fields/columns?
Viraj

> Pig simple join does not work when it contains empty lines
> --
>
> Key: PIG-1131
> URL: https://issues.apache.org/jira/browse/PIG-1131
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>Priority: Critical
> Fix For: 0.7.0
>
> Attachments: junk1.txt, junk2.txt, simplejoinscript.pig
>
>
> I have a simple script, which does a JOIN.
> {code}
> input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
> describe input1;
> input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
> describe input2;
> joineddata = JOIN input1 by $0, input2 by $0;
> describe joineddata;
> store joineddata into 'result';
> {code}
> The input data contains empty lines.  
> The join fails in the Map phase with the following error in the 
> PRLocalRearrange.java
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>   at java.util.ArrayList.get(ArrayList.java:322)
>   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>   at org.apache.hadoop.mapred.Child.main(Child.java:159)
> I am surprised that the test cases did not detect this error. Could we add 
> this data which contains empty lines to the testcases?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines

2009-12-07 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1131:


Attachment: simplejoinscript.pig
junk2.txt
junk1.txt

Dummy datasets and pig script

> Pig simple join does not work when it contains empty lines
> --
>
> Key: PIG-1131
> URL: https://issues.apache.org/jira/browse/PIG-1131
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>Priority: Critical
> Fix For: 0.7.0
>
> Attachments: junk1.txt, junk2.txt, simplejoinscript.pig
>
>
> I have a simple script, which does a JOIN.
> {code}
> input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
> describe input1;
> input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
> describe input2;
> joineddata = JOIN input1 by $0, input2 by $0;
> describe joineddata;
> store joineddata into 'result';
> {code}
> The input data contains empty lines.  
> The join fails in the Map phase with the following error in the 
> PRLocalRearrange.java
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>   at java.util.ArrayList.get(ArrayList.java:322)
>   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>   at org.apache.hadoop.mapred.Child.main(Child.java:159)
> I am surprised that the test cases did not detect this error. Could we add 
> this data which contains empty lines to the testcases?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1131) Pig simple join does not work when it contains empty lines

2009-12-07 Thread Viraj Bhat (JIRA)
Pig simple join does not work when it contains empty lines
--

 Key: PIG-1131
 URL: https://issues.apache.org/jira/browse/PIG-1131
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.7.0


I have a simple script, which does a JOIN.

{code}
input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
describe input1;

input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
describe input2;

joineddata = JOIN input1 by $0, input2 by $0;

describe joineddata;

store joineddata into 'result';
{code}

The input data contains empty lines.  

The join fails in the Map phase with the following error in the 
PRLocalRearrange.java

java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)

I am surprised that the test cases did not detect this error. Could we add this 
data which contains empty lines to the testcases?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1124) Unable to set Custom Job Name using the -Dmapred.job.name parameter

2009-12-03 Thread Viraj Bhat (JIRA)
Unable to set Custom Job Name using the -Dmapred.job.name parameter
---

 Key: PIG-1124
 URL: https://issues.apache.org/jira/browse/PIG-1124
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Priority: Minor
 Fix For: 0.6.0


As a Hadoop user I want to control the Job name for my analysis via the command 
line using the following construct::

java -cp pig.jar:$HADOOP_HOME/conf -Dmapred.job.name=hadoop_junkie 
org.apache.pig.Main broken.pig

-Dmapred.job.name should normally set my Hadoop Job name, but somehow during 
the formation of the job.xml in Pig this information is lost and the job name 
turns out to be:

"PigLatin:broken.pig"

The current workaround seems to be wiring it in the script itself, using the 
following ( or using parameter substitution).

set job.name 'my job'

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1123) Popularize usage of default_parallel keyword in Cookook and Latin Manual

2009-12-03 Thread Viraj Bhat (JIRA)
Popularize usage of default_parallel keyword in Cookook and Latin Manual


 Key: PIG-1123
 URL: https://issues.apache.org/jira/browse/PIG-1123
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


In the Pig 0.5 release we have the option of setting the default reduce 
parallelism for a script using the following construct:

set default_parallel 100

Unfortunately I do not see this documented on the Reference Manual, in the 
"SET"  section.

http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html

or in the Cookbook: 

http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html

"Use PARALLEL Keyword" section.


Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1101) Pig parser does not recognize its own data type in LIMIT statement

2009-11-20 Thread Viraj Bhat (JIRA)
Pig parser does not recognize its own data type in LIMIT statement
--

 Key: PIG-1101
 URL: https://issues.apache.org/jira/browse/PIG-1101
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Priority: Minor
 Fix For: 0.6.0


I have a Pig script in which I specify the number of records to limit as a long 
type. 

{code}
A = LOAD '/user/viraj/echo.txt' AS (txt:chararray);

B = LIMIT A 10L;

DUMP B;
{code}

I get a parser error:

2009-11-21 02:25:51,100 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Encountered "  "10L "" at line 3, 
column 13.
Was expecting:
 ...
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.generateParseException(QueryParser.java:8963)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_consume_token(QueryParser.java:8839)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.LimitClause(QueryParser.java:1656)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1280)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:682)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)

In fact 10L seems to work in the foreach generate construct.

Viraj



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1084) Pig CookBook documentation "Take Advantage of Join Optimization" additions:Merge and Skewed Join

2009-11-10 Thread Viraj Bhat (JIRA)
Pig CookBook documentation "Take Advantage of Join Optimization" 
additions:Merge and Skewed Join


 Key: PIG-1084
 URL: https://issues.apache.org/jira/browse/PIG-1084
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


Hi all,
 We have a host of Join optimizations that have been implemented recently in 
Pig to improve performance. These include:

http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html#JOIN

1) Merge Join
2) Skewed Join

It would be nice to mention the Merge Join and Skewed join in the following 
section on the PigCookBook

http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Take+Advantage+of+Join+Optimization

Can we update this release 0.6??

Thanks
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1081) PigCookBook use of PARALLEL keyword

2009-11-10 Thread Viraj Bhat (JIRA)
PigCookBook use of PARALLEL keyword
---

 Key: PIG-1081
 URL: https://issues.apache.org/jira/browse/PIG-1081
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.5.0
Reporter: Viraj Bhat
 Fix For: 0.5.0


Hi all,
 I am looking at some tips for optimizing Pig programs (Pig Cookbook) using the 
PARALLEL keyword.

http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Use+PARALLEL+Keyword 
We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches 1 
reducer for all cases. 

In this documentation we state that:  *  * 0.9, this documentation was valid for HoD (Hadoop on Demand) where 
you are creating your own Hadoop clusters, but if you are using:

Either the Capacity Scheduler 
http://hadoop.apache.org/common/docs/current/capacity_scheduler.html or the 
Fair Share Scheduler 
http://hadoop.apache.org/common/docs/current/fair_scheduler.html , these 
numbers could mean that you are using around 90% of your reducer slots in your 
machine.

We should change this to something like: 
The number of reducers you may need for a particular construct in Pig which 
forms a Map Reduce boundary depends entirely on your data and the number of 
intermediate keys you are generating in your mappers. In best cases we have 
seen that a reducer processing about 500 MB of data behaves efficiently. 
Additionally it is hard to define the optimum number of reducers, since it 
completely depends on the paritioner and distribution of map (combiner) output 
keys.

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1060) MultiQuery optimization throws error for multi-level splits

2009-11-04 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773744#action_12773744
 ] 

Viraj Bhat commented on PIG-1060:
-

Hi Ankur and Richard,
 I have a script which demonstrates a similar problem, but can be solved by 
using the -M option. This script can reproduce the problem even without the 
UNION operator , but it has  properties 1 and 2 of the original problem 
description.

Try commenting out the F alias. It works fine.

{code}

ORGINALDATA = load '/user/viraj/somedata.txt' using PigStorage() as (col1, 
col2, col3, col4, col5, col6, col7, col8);



--Check data

A = foreach ORGINALDATA generate col1, col2, col3, col4, col5, col6;

B = group A all;

C = foreach B generate COUNT(A);

store C into '/user/viraj/result1';



D = filter A by (col1 == col2) or (col1 == col3);

E = group D all;

F = foreach E generate COUNT(D);

--try commenting F
store F into '/user/viraj/result2';



G = filter D by (col4 == col5) ;

H = group G all;

I = foreach H generate COUNT(G);

store I into '/user/viraj/result3';



J = filter G by (((col6 == 'm') or (col6 == 'M')) and (col6 == 1)) or (((col6 
== 'f') or (col6 == 'F')) and (col6 == 0)) or ((col6 == '') and (col6 == -1));

K = group J all;

L = foreach K generate COUNT(J);

store L into '/user/viraj/result4';

{code}



> MultiQuery optimization throws error for multi-level splits
> ---
>
> Key: PIG-1060
> URL: https://issues.apache.org/jira/browse/PIG-1060
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Ankur
>Assignee: Richard Ding
>
> Consider the following scenario :-
> 1. Multi-level splits in the map plan.
> 2. Each split branch further progressing across a local-global rearrange.
> 3. Output of each of these finally merged via a UNION.
> MultiQuery optimizer throws the following error in such a case:
> "ERROR 2146: Internal Error. Inconsistency in key index found during 
> optimization."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's

2009-10-29 Thread Viraj Bhat (JIRA)
In-determinate behaviour of Union when there are 2 non-matching schema's


 Key: PIG-1065
 URL: https://issues.apache.org/jira/browse/PIG-1065
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have a script which first does a union of these schemas and then does a ORDER 
BY of this result.

{code}
f1 = LOAD '1.txt' as (key:chararray, v:chararray);
f2 = LOAD '2.txt' as (key:chararray);
u0 = UNION f1, f2;
describe u0;
dump u0;

u1 = ORDER u0 BY $0;
dump u1;
{code}

When I run in Map Reduce mode I get the following result:
$java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig

Schema for u0 unknown.

(1,2)
(2,3)
(1)
(2)

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open 
iterator for alias u1
at org.apache.pig.PigServer.openIterator(PigServer.java:475)
at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:397)

Caused by: java.io.IOException: Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableBytesWritable, recieved 
org.apache.pig.impl.io.NullableText
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)


When I run the same script in local mode I get a different result, as we know 
that local mode does not use any Hadoop Classes.
$java -cp pig.jar org.apache.pig.Main -x local broken.pig

Schema for u0 unknown

(1,2)
(1)
(2,3)
(2)

(1,2)
(1)
(2,3)
(2)


Here are some questions
1) Why do we allow union if the schemas do not match
2) Should we not print an error message/warning so that the user knows that 
this is not allowed or he can get unexpected results?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1064) Behvaiour of COGROUP with and without schema when using "*" operator

2009-10-29 Thread Viraj Bhat (JIRA)
Behvaiour of COGROUP with and without schema when using "*" operator


 Key: PIG-1064
 URL: https://issues.apache.org/jira/browse/PIG-1064
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have 2 tab separated files, "1.txt" and "2.txt"

$ cat 1.txt 

1   2

2   3


$ cat 2.txt 

1   2

2   3

I use COGROUP feature of Pig in the following way:

$java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main

{code}
grunt> A = load '1.txt';
grunt> B = load '2.txt' as (b0, b1);
grunt> C = cogroup A by *, B by *;  
{code}

2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1012: Each COGroup input has to have the same number of inner plans
Details at logfile: pig_1256845224752.log
==

If I reverse, the order of the schema's
{code}
grunt> A = load '1.txt' as (a0, a1);
grunt> B = load '2.txt';
grunt> C = cogroup A by *, B by *;  
{code}
2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1013: Grouping attributes can either be star (*) or a list of expressions, but 
not both.
Details at logfile: pig_1256845224752.log

==
Now running without schema??
{code}
grunt> A = load '1.txt';
grunt> B = load '2.txt';
grunt> C = cogroup A by *, B by *;
grunt> dump C; 
{code}

2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
stored result in: "file:/tmp/temp-319926700/tmp-1990275961"
2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written 
: 2
2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 
154
2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-10-29 12:55:37,202 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!

((1,2),{(1,2)},{(1,2)})
((2,3),{(2,3)},{(2,3)})
==

Is this a bug or a feature?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1031) PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double

2009-10-20 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1031:


Description: 
I have a data stored in a text file as:

{(4153E765)}
{(AF533765)}


I try reading it using PigStorage as:

{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:bytearray)});
dump A;
{code}

I get the following results:


({(Infinity)})
({(AF533765)})


The problem seems to be with the method: parseFromBytes(byte[] b) in class 
Utf8StorageConverter. This method uses the TextDataParser (class generated via 
jjt) to interpret the type of data from content, even though the schema tells 
it is a bytearray. 

TextDataParser.jjt  sample code
{code}
TOKEN :
{
...
 < DOUBLENUMBER: (["-","+"])?  ( ["e","E"] ([ "-","+"])? 
 )?>
 < FLOATNUMBER:  (["f","F"])? >
...
}
{code}

I tried the following options, but it will not work as we need to call 
bytesToBag(byte[] b) in the Utf8StorageConverter class.
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term)});
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:chararray)});
{code}


Viraj

  was:
I have a data stored in a text file as:

{(4153E765)}
{(AF533765)}

I try reading it using PigStorage as:
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:bytearray)});
dump A;
{code}

I get the following results:

{code}
({(Infinity)})
({(AF533765)})
{code}

The problem seems to be with the method: parseFromBytes(byte[] b) in class 
Utf8StorageConverter. This method uses the TextDataParser (class generated via 
jjt) to interpret the type of data from content, even though the schema tells 
it is a bytearray. 

TextDataParser.jjt  sample code
{code}
TOKEN :
{
...
 < DOUBLENUMBER: (["-","+"])?  ( ["e","E"] ([ "-","+"])? 
 )?>
 < FLOATNUMBER:  (["f","F"])? >
...
}
{code}

I tried the following options, but it will not work as we need to call 
bytesToBag(byte[] b) in the Utf8StorageConverter class.
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term)});
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:chararray)});
{code}


Viraj


> PigStorage interpreting chararray/bytearray for a tuple element inside a bag 
> as float or double
> ---
>
> Key: PIG-1031
> URL: https://issues.apache.org/jira/browse/PIG-1031
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.5.0
>Reporter: Viraj Bhat
> Fix For: 0.5.0, 0.6.0
>
>
> I have a data stored in a text file as:
> {(4153E765)}
> {(AF533765)}
> I try reading it using PigStorage as:
> {code}
> A = load 'pigstoragebroken.dat' using PigStorage() as 
> (intersectionBag:bag{T:tuple(term:bytearray)});
> dump A;
> {code}
> I get the following results:
> ({(Infinity)})
> ({(AF533765)})
> The problem seems to be with the method: parseFromBytes(byte[] b) in class 
> Utf8StorageConverter. This method uses the TextDataParser (class generated 
> via jjt) to interpret the type of data from content, even though the schema 
> tells it is a bytearray. 
> TextDataParser.jjt  sample code
> {code}
> TOKEN :
> {
> ...
>  < DOUBLENUMBER: (["-","+"])?  ( ["e","E"] ([ "-","+"])? 
>  )?>
>  < FLOATNUMBER:  (["f","F"])? >
> ...
> }
> {code}
> I tried the following options, but it will not work as we need to call 
> bytesToBag(byte[] b) in the Utf8StorageConverter class.
> {code}
> A = load 'pigstoragebroken.dat' using PigStorage() as 
> (intersectionBag:bag{T:tuple(term)});
> A = load 'pigstoragebroken.dat' using PigStorage() as 
> (intersectionBag:bag{T:tuple(term:chararray)});
> {code}
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1031) PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double

2009-10-20 Thread Viraj Bhat (JIRA)
PigStorage interpreting chararray/bytearray for a tuple element inside a bag as 
float or double
---

 Key: PIG-1031
 URL: https://issues.apache.org/jira/browse/PIG-1031
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.5.0
Reporter: Viraj Bhat
 Fix For: 0.5.0, 0.6.0


I have a data stored in a text file as:

{(4153E765)}
{(AF533765)}

I try reading it using PigStorage as:
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:bytearray)});
dump A;
{code}

I get the following results:

{code}
({(Infinity)})
({(AF533765)})
{code}

The problem seems to be with the method: parseFromBytes(byte[] b) in class 
Utf8StorageConverter. This method uses the TextDataParser (class generated via 
jjt) to interpret the type of data from content, even though the schema tells 
it is a bytearray. 

TextDataParser.jjt  sample code
{code}
TOKEN :
{
...
 < DOUBLENUMBER: (["-","+"])?  ( ["e","E"] ([ "-","+"])? 
 )?>
 < FLOATNUMBER:  (["f","F"])? >
...
}
{code}

I tried the following options, but it will not work as we need to call 
bytesToBag(byte[] b) in the Utf8StorageConverter class.
{code}
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term)});
A = load 'pigstoragebroken.dat' using PigStorage() as 
(intersectionBag:bag{T:tuple(term:chararray)});
{code}


Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-978) ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) and ERROR 2999: (Unexpected internal error. null) when using Multi-Query optimization

2009-09-25 Thread Viraj Bhat (JIRA)
ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) 
and ERROR 2999: (Unexpected internal error. null) when using Multi-Query 
optimization
---

 Key: PIG-978
 URL: https://issues.apache.org/jira/browse/PIG-978
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have  Pig script of this form.. which I execute using Multi-query 
optimization.

{code}
A = load '/user/viraj/firstinput' using PigStorage();
B = group 
C = ..agrregation function
store C into '/user/viraj/firstinputtempresult/days1';
..
Atab = load '/user/viraj/secondinput' using PigStorage();
Btab = group 
Ctab = ..agrregation function
store Ctab into '/user/viraj/secondinputtempresult/days1';
..
E = load '/user/viraj/firstinputtempresult/' using PigStorage();
F = group 
G = aggregation function
store G into '/user/viraj/finalresult1';

Etab = load '/user/viraj/secondinputtempresult/' using PigStorage();
Ftab = group 
Gtab = aggregation function
store Gtab into '/user/viraj/finalresult2';
{code}


2009-07-20 22:05:44,507 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
ERROR 2100: hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist. 
Details at logfile: /homes/viraj/pigscripts/pig_1248127173601.log)  

is due to the mismatch of store/load commands. The script first stores files 
into the 'days1' directory (store C into 
'/user/viraj/firstinputtempresult/days1' using PigStorage();), but it later 
loads from the top level directory (E = load 
'/user/viraj/firstinputtempresult/' using PigStorage()) instead of the original 
directory (/user/viraj/firstinputtempresult/days1).

The current multi-query optimizer can't solve the dependency between these two 
commands--they have different load file paths. So the jobs will run 
concurrently and result in the errors.

The solution is to add 'exec' or 'run' command after the first two stores . 
This will force the first two store commands to run before the rest commands.

It would be nice to see this fixed as a part of an enhancement to the 
Multi-query. We either disable the Multi-query or throw a warning/error 
message, so that the user can correct his load/store statements.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-974) Issues with mv command when used after store when using -param_file/-param options

2009-09-23 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758962#action_12758962
 ] 

Viraj Bhat commented on PIG-974:


It turns out that the problem was due to single quotes.
{code}
mv '$finalop' '$finalmove';
{code}

This piece of modified script should work..
{code}
mv $finalop $finalmove;
{code}

The hard part here is when to use single quotes for parameters and when we 
should not..This is not documented in the manual.

The error message is also confusing..
===
java.io.IOException: File or directory '/user/viraj/finaloutput' does not exist.
===

I thought that the single quotes against the filename printed in the error 
message refers to the correct file name.

{code}
$shell>hadoop fs -ls '/user/viraj/finaloutput' 
Found 1 items
-rw---   3 viraj users420 2009-09-24 01:16 
/user/viraj/finaloutput/part-0
{code}

Thanks Viraj

> Issues with mv command when used after store when using -param_file/-param 
> options
> --
>
> Key: PIG-974
> URL: https://issues.apache.org/jira/browse/PIG-974
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
> Environment: Hadoop 18 and 20
>Reporter: Viraj Bhat
> Fix For: 0.6.0
>
> Attachments: studenttab10k
>
>
> I have a Pig script which moves the final output to another HDFS directory to 
> signal completion, so that another Pig script can start working on these 
> results.
> {code}
> studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, 
> age:int,gpa:float);
> X = GROUP studenttab by age;
> Y = FOREACH X GENERATE group, COUNT(studenttab);
> store Y into '$finalop' using PigStorage();
> mv '$finalop' '$finalmove';
> {code}
> where "finalop" and "finalmove" are parameters used storing intermediate and 
> final results.
> I run this script as this:
> {code}
> $shell> java -cp pig20.jar:/path/tohadoop/site.xml 
> -Dmapred.job.queue.name=default org.apache.pig.Main -M -param 
> finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove 
> testmove.pig 
> {code}
> or using the param_file option
> {code}
> $shell>java -cp pig20.jar:/path/tohadoop/site.xml 
> -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file 
> moveparamfile  testmove.pig
> {code}
> 
> The underlying Map Reduce jobs run well but the move command seems to be 
> failing:
> 
> 2009-09-23 23:26:21,781 [main] INFO  org.apache.pig.Main - Logging error 
> messages to: /homes/viraj/pigscripts/pig_1253748381778.log
> 2009-09-23 23:26:21,963 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to hadoop file system at: hdfs://localhost:8020
> 2009-09-23 23:26:22,227 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to map-reduce job tracker at: localhost:50300
> 2009-09-23 23:26:27,187 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
>  - Choosing to move algebraic foreach to combiner
> 2009-09-23 23:26:27,203 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size before optimization: 1
> 2009-09-23 23:26:27,203 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size after optimization: 1
> 2009-09-23 23:26:28,828 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>  - Setting up single store job
> 2009-09-23 23:26:29,423 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
> Use GenericOptionsParser for parsing the arguments. Applications should 
> implement Tool for the same.
> 2009-09-23 23:26:29,478 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 0% complete
> 2009-09-23 23:27:29,828 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 50% complete
> 2009-09-23 23:27:59,764 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 50% complete
> 2009-09-23 23:28:57,249 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 100% complete
> 2009-09-23 23:28:57,249 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Successfully stored result in: "/user/viraj/finaloutput"
> 2009-09-23 23:28:57,267 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapRed

[jira] Updated: (PIG-974) Issues with mv command when used after store when using -param_file/-param options

2009-09-23 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-974:
---

Attachment: studenttab10k

Testdata

> Issues with mv command when used after store when using -param_file/-param 
> options
> --
>
> Key: PIG-974
> URL: https://issues.apache.org/jira/browse/PIG-974
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
> Environment: Hadoop 18 and 20
>Reporter: Viraj Bhat
> Fix For: 0.6.0
>
> Attachments: studenttab10k
>
>
> I have a Pig script which moves the final output to another HDFS directory to 
> signal completion, so that another Pig script can start working on these 
> results.
> {code}
> studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, 
> age:int,gpa:float);
> X = GROUP studenttab by age;
> Y = FOREACH X GENERATE group, COUNT(studenttab);
> store Y into '$finalop' using PigStorage();
> mv '$finalop' '$finalmove';
> {code}
> where "finalop" and "finalmove" are parameters used storing intermediate and 
> final results.
> I run this script as this:
> {code}
> $shell> java -cp pig20.jar:/path/tohadoop/site.xml 
> -Dmapred.job.queue.name=default org.apache.pig.Main -M -param 
> finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove 
> testmove.pig 
> {code}
> or using the param_file option
> {code}
> $shell>java -cp pig20.jar:/path/tohadoop/site.xml 
> -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file 
> moveparamfile  testmove.pig
> {code}
> 
> The underlying Map Reduce jobs run well but the move command seems to be 
> failing:
> 
> 2009-09-23 23:26:21,781 [main] INFO  org.apache.pig.Main - Logging error 
> messages to: /homes/viraj/pigscripts/pig_1253748381778.log
> 2009-09-23 23:26:21,963 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to hadoop file system at: hdfs://localhost:8020
> 2009-09-23 23:26:22,227 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to map-reduce job tracker at: localhost:50300
> 2009-09-23 23:26:27,187 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
>  - Choosing to move algebraic foreach to combiner
> 2009-09-23 23:26:27,203 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size before optimization: 1
> 2009-09-23 23:26:27,203 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size after optimization: 1
> 2009-09-23 23:26:28,828 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>  - Setting up single store job
> 2009-09-23 23:26:29,423 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
> Use GenericOptionsParser for parsing the arguments. Applications should 
> implement Tool for the same.
> 2009-09-23 23:26:29,478 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 0% complete
> 2009-09-23 23:27:29,828 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 50% complete
> 2009-09-23 23:27:59,764 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 50% complete
> 2009-09-23 23:28:57,249 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 100% complete
> 2009-09-23 23:28:57,249 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Successfully stored result in: "/user/viraj/finaloutput"
> 2009-09-23 23:28:57,267 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Records written : 60
> 2009-09-23 23:28:57,267 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Bytes written : 420
> 2009-09-23 23:28:57,267 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Success!
> 2009-09-23 23:28:57,367 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' 
> does not exist.
> Details at logfile: /homes/viraj/pigscripts/pig_1253748381778.log
> 
> {code}
> $shell> hadoop fs -ls /user/viraj/finaloutput 
> Found 1 items
> -rw---   3 viraj users420 2009-09-23 23:42 
> /user/viraj/finaloutput/part-0
> {code}
> 

[jira] Created: (PIG-974) Issues with mv command when used after store when using -param_file/-param options

2009-09-23 Thread Viraj Bhat (JIRA)
Issues with mv command when used after store when using -param_file/-param 
options
--

 Key: PIG-974
 URL: https://issues.apache.org/jira/browse/PIG-974
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
 Environment: Hadoop 18 and 20
Reporter: Viraj Bhat
 Fix For: 0.6.0
 Attachments: studenttab10k

I have a Pig script which moves the final output to another HDFS directory to 
signal completion, so that another Pig script can start working on these 
results.
{code}
studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, 
age:int,gpa:float);
X = GROUP studenttab by age;
Y = FOREACH X GENERATE group, COUNT(studenttab);
store Y into '$finalop' using PigStorage();
mv '$finalop' '$finalmove';
{code}

where "finalop" and "finalmove" are parameters used storing intermediate and 
final results.

I run this script as this:
{code}
$shell> java -cp pig20.jar:/path/tohadoop/site.xml 
-Dmapred.job.queue.name=default org.apache.pig.Main -M -param 
finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove 
testmove.pig 
{code}
or using the param_file option
{code}
$shell>java -cp pig20.jar:/path/tohadoop/site.xml 
-Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file 
moveparamfile  testmove.pig
{code}

The underlying Map Reduce jobs run well but the move command seems to be 
failing:

2009-09-23 23:26:21,781 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /homes/viraj/pigscripts/pig_1253748381778.log
2009-09-23 23:26:21,963 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: hdfs://localhost:8020
2009-09-23 23:26:22,227 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
map-reduce job tracker at: localhost:50300
2009-09-23 23:26:27,187 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer 
- Choosing to move algebraic foreach to combiner
2009-09-23 23:26:27,203 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 1
2009-09-23 23:26:27,203 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 1
2009-09-23 23:26:28,828 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
2009-09-23 23:26:29,423 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
Use GenericOptionsParser for parsing the arguments. Applications should 
implement Tool for the same.
2009-09-23 23:26:29,478 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
2009-09-23 23:27:29,828 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 50% complete
2009-09-23 23:27:59,764 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 50% complete
2009-09-23 23:28:57,249 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete
2009-09-23 23:28:57,249 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Successfully stored result in: "/user/viraj/finaloutput"
2009-09-23 23:28:57,267 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Records written : 60
2009-09-23 23:28:57,267 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Bytes written : 420
2009-09-23 23:28:57,267 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Success!
2009-09-23 23:28:57,367 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' 
does not exist.
Details at logfile: /homes/viraj/pigscripts/pig_1253748381778.log

{code}
$shell> hadoop fs -ls /user/viraj/finaloutput 
Found 1 items
-rw---   3 viraj users420 2009-09-23 23:42 
/user/viraj/finaloutput/part-0
{code}

Opening the log file:

Pig Stack Trace
---
ERROR 2998: Unhandled internal error. File or directory 
'/user/viraj/finaloutput' does not exist.

java.io.IOException: File or directory '/user/viraj/finaloutput' does not exist.
at 
org.apache.pig.tools.grunt.GruntParser.processM

[jira] Commented: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig

2009-08-31 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749722#action_12749722
 ] 

Viraj Bhat commented on PIG-940:


One important point to add:
{code}
localmachine.company.com prompt> hadoop fs -ls 
hdfs://remotemachine1.company.com/user/viraj//*.txt
-rw-r--r--   3 viraj users 13 2009-08-13 23:42 /user/viraj/A1.txt
-rw-r--r--   3 viraj users  8 2009-08-29 00:51 /user/viraj/B1.txt
{code}

> Cross site HDFS access using the default.fs.name not possible in Pig
> 
>
> Key: PIG-940
> URL: https://issues.apache.org/jira/browse/PIG-940
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
> Environment: Hadoop 20
>Reporter: Viraj Bhat
> Fix For: 0.3.0
>
>
> I have a script which does the following.. access data from a remote HDFS 
> location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I 
> do not want to copy this huge amount of data between HDFS locations]].
> However I want my Pigscript  to write data to the HDFS running on 
> localmachine.company.com.
> Currently Pig does not support that behavior and complains that: 
> "hdfs://localmachine.company.com/user/viraj/A1.txt does not exist"
> {code}
> A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); 
> B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); 
> C = JOIN A by a, B by c; 
> store C into 'output' using PigStorage();  
> {code}
> ===
> 2009-09-01 00:37:24,032 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to hadoop file system at: hdfs://localmachine.company.com:8020
> 2009-09-01 00:37:24,277 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to map-reduce job tracker at: localmachine.company.com:50300
> 2009-09-01 00:37:24,567 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer
>  - Rewrite: POPackage->POForEach to POJoinPackage
> 2009-09-01 00:37:24,573 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size before optimization: 1
> 2009-09-01 00:37:24,573 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>  - MR plan size after optimization: 1
> 2009-09-01 00:37:26,197 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>  - Setting up single store job
> 2009-09-01 00:37:26,249 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
> Use GenericOptionsParser for parsing the arguments. Applications should 
> implement Tool for the same.
> 2009-09-01 00:37:26,746 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 0% complete
> 2009-09-01 00:37:26,746 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 100% complete
> 2009-09-01 00:37:26,747 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - 1 map reduce job(s) failed!
> 2009-09-01 00:37:26,756 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Failed to produce result in: 
> "hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480"
> 2009-09-01 00:37:26,756 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>  - Failed!
> 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
> Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log
> ===
> The error file in Pig contains:
> ===
> ERROR 2998: Unhandled internal error. 
> org.apache.pig.backend.executionengine.ExecException: ERROR 2100: 
> hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
> at 
> org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126)
> at 
> org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
> at 
> org.apache.pig.impl.io.ValidatingInputFileSpec.(ValidatingInputFileSpec.java:44)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228)
> at 

[jira] Created: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig

2009-08-31 Thread Viraj Bhat (JIRA)
Cross site HDFS access using the default.fs.name not possible in Pig


 Key: PIG-940
 URL: https://issues.apache.org/jira/browse/PIG-940
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
 Environment: Hadoop 20
Reporter: Viraj Bhat
 Fix For: 0.3.0


I have a script which does the following.. access data from a remote HDFS 
location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I 
do not want to copy this huge amount of data between HDFS locations]].

However I want my Pigscript  to write data to the HDFS running on 
localmachine.company.com.

Currently Pig does not support that behavior and complains that: 
"hdfs://localmachine.company.com/user/viraj/A1.txt does not exist"

{code}
A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); 
B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); 
C = JOIN A by a, B by c; 
store C into 'output' using PigStorage();  
{code}
===
2009-09-01 00:37:24,032 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: hdfs://localmachine.company.com:8020
2009-09-01 00:37:24,277 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
map-reduce job tracker at: localmachine.company.com:50300
2009-09-01 00:37:24,567 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer
 - Rewrite: POPackage->POForEach to POJoinPackage
2009-09-01 00:37:24,573 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 1
2009-09-01 00:37:24,573 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 1
2009-09-01 00:37:26,197 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
2009-09-01 00:37:26,249 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
Use GenericOptionsParser for parsing the arguments. Applications should 
implement Tool for the same.
2009-09-01 00:37:26,746 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
2009-09-01 00:37:26,746 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete
2009-09-01 00:37:26,747 [main] ERROR 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 1 map reduce job(s) failed!
2009-09-01 00:37:26,756 [main] ERROR 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed to produce result in: 
"hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480"
2009-09-01 00:37:26,756 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!
2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log
===

The error file in Pig contains:
===
ERROR 2998: Unhandled internal error. 
org.apache.pig.backend.executionengine.ExecException: ERROR 2100: 
hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
at 
org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126)
at 
org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
at 
org.apache.pig.impl.io.ValidatingInputFileSpec.(ValidatingInputFileSpec.java:44)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)

java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not 

[jira] Updated: (PIG-921) Strange use case for Join which produces different results in local and map reduce mode

2009-08-13 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-921:
---

Attachment: joinusecase.pig
B.txt
A.txt

Script with test data.

> Strange use case for Join which produces different results in local and map 
> reduce mode
> ---
>
> Key: PIG-921
> URL: https://issues.apache.org/jira/browse/PIG-921
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
> Environment: Hadoop 18 and Hadoop 20
>Reporter: Viraj Bhat
> Fix For: 0.3.0
>
> Attachments: A.txt, B.txt, joinusecase.pig
>
>
> I have script in this manner, loads from 2 files A.txt and B.txt
> {code}
> A = LOAD 'A.txt' as (a:tuple(a1:int, a2:chararray));
> B = LOAD 'B.txt' as (b:tuple(b1:int, b2:chararray));
> C = JOIN A by a.a1, B by b.b1;
> DESCRIBE C;
> DUMP C;
> {code}
> A.txt contains the following lines:
> {code}
> (1,a)
> (2,aa)
> {code}
> B.txt contains the following lines:
> {code}
> (1,b)
> (2,bb)
> {code}
> Now running the above script in local and map reduce mode on Hadoop 18 & 
> Hadoop 20, produces the following:
> Hadoop 18
> =
> (1,1)
> (2,2)
> =
> Hadoop 20
> =
> (1,1)
> (2,2)
> =
> Local Mode: Pig with Hadoop 18 jar release 
> =
> 2009-08-13 17:15:13,473 [main] INFO  org.apache.pig.Main - Logging error 
> messages to: /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log
> 09/08/13 17:15:13 INFO pig.Main: Logging error messages to: 
> /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log
> C: {a: (a1: int,a2: chararray),b: (b1: int,b2: chararray)}
> 2009-08-13 17:15:13,932 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1002: Unable to store alias C
> 09/08/13 17:15:13 ERROR grunt.Grunt: ERROR 1002: Unable to store alias C
> Details at logfile: 
> /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log
> =
> Caused by: java.lang.NullPointerException
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:206)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:191)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.local.executionengine.physicalLayer.counters.POCounter.getNext(POCounter.java:71)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
> at 
> org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:146)
> at 
> org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:109)
> at 
> org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:165)
> ... 9 more
> =
> Local Mode: Pig with Hadoop 20 jar release
> =
> ((1,a),(1,b))
> ((2,aa),(2,bb)
> =

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-921) Strange use case for Join which produces different results in local and map reduce mode

2009-08-13 Thread Viraj Bhat (JIRA)
Strange use case for Join which produces different results in local and map 
reduce mode
---

 Key: PIG-921
 URL: https://issues.apache.org/jira/browse/PIG-921
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
 Environment: Hadoop 18 and Hadoop 20
Reporter: Viraj Bhat
 Fix For: 0.3.0


I have script in this manner, loads from 2 files A.txt and B.txt
{code}
A = LOAD 'A.txt' as (a:tuple(a1:int, a2:chararray));
B = LOAD 'B.txt' as (b:tuple(b1:int, b2:chararray));
C = JOIN A by a.a1, B by b.b1;
DESCRIBE C;
DUMP C;
{code}

A.txt contains the following lines:
{code}
(1,a)
(2,aa)
{code}


B.txt contains the following lines:
{code}
(1,b)
(2,bb)
{code}

Now running the above script in local and map reduce mode on Hadoop 18 & Hadoop 
20, produces the following:

Hadoop 18
=
(1,1)
(2,2)
=
Hadoop 20
=
(1,1)
(2,2)
=
Local Mode: Pig with Hadoop 18 jar release 
=
2009-08-13 17:15:13,473 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log
09/08/13 17:15:13 INFO pig.Main: Logging error messages to: 
/homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log
C: {a: (a1: int,a2: chararray),b: (b1: int,b2: chararray)}
2009-08-13 17:15:13,932 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1002: Unable to store alias C
09/08/13 17:15:13 ERROR grunt.Grunt: ERROR 1002: Unable to store alias C
Details at logfile: /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log
=
Caused by: java.lang.NullPointerException
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:206)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:191)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
at 
org.apache.pig.backend.local.executionengine.physicalLayer.counters.POCounter.getNext(POCounter.java:71)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
at 
org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:146)
at 
org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:109)
at 
org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:165)
... 9 more
=
Local Mode: Pig with Hadoop 20 jar release
=
((1,a),(1,b))
((2,aa),(2,bb)
=

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   >