from:"Dmitriy V. Ryaboy $JIRA$"

[jira] [Updated] (PIG-4059) Pig on Spark

2014-09-05 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-4059:
---
Labels: spork  (was: )

 Pig on Spark
 

 Key: PIG-4059
 URL: https://issues.apache.org/jira/browse/PIG-4059
 Project: Pig
  Issue Type: New Feature
Reporter: Rohini Palaniswamy
Assignee: Praveen Rachabattuni
  Labels: spork
 Attachments: Pig-on-Spark-Design-Doc.pdf


There is lot of interest in adding Spark as a backend execution engine for 
 Pig. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-3558) ORC support for Pig

2014-07-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067289#comment-14067289
 ] 

Dmitriy V. Ryaboy commented on PIG-3558:


Nice.

How much does this increase the weight of the pig build, and what packages does 
it pull in?

I assume this won't get pushed to trunk until hive 0.14.0-SNAPSHOT becomes 
available as a stable version?

 ORC support for Pig
 ---

 Key: PIG-3558
 URL: https://issues.apache.org/jira/browse/PIG-3558
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
  Labels: porc
 Fix For: 0.14.0

 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch, 
 PIG-3558-4.patch, PIG-3558-5.patch, PIG-3558-6.patch


 Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PIG-3558) ORC support for Pig

2014-04-10 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3558:
---

Labels: porc  (was: )

 ORC support for Pig
 ---

 Key: PIG-3558
 URL: https://issues.apache.org/jira/browse/PIG-3558
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
  Labels: porc
 Fix For: 0.13.0

 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch, 
 PIG-3558-4.patch


 Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PIG-3558) ORC support for Pig

2014-02-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904574#comment-13904574
 ] 

Dmitriy V. Ryaboy commented on PIG-3558:


I am pro adding ORC support in Pig, but against introducing massive 
dependencies.

According to http://mvnrepository.com/artifact/org.apache.hive/hive-exec/0.12.0 
the hive-exec jar for 0.12 is 9 megs, and hides within it specific versions of 
jackson, snappy, org.json, chunks of thrift, hadoop.io (?!), avro, commons, 
protobuf, and guava. If ORC authors are not interested in reducing their 
dependency hygene, they have to live with the fact that their project is 
unlikely to get integrated into other projects.

This is self-inflicted jar hell. Please don't do this. When ORC cleans up their 
dependencies, let's revisit.

 ORC support for Pig
 ---

 Key: PIG-3558
 URL: https://issues.apache.org/jira/browse/PIG-3558
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.13.0

 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch


 Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (PIG-3558) ORC support for Pig

2014-02-18 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904574#comment-13904574
]

Dmitriy V. Ryaboy edited comment on PIG-3558 at 2/18/14 8:55 PM:
-

I am pro adding ORC support in Pig, but against introducing massive
dependencies.

According to http://mvnrepository.com/artifact/org.apache.hive/hive-exec/0.12.0
the hive-exec jar for 0.12 is 9 megs, and hides within it specific versions of
jackson, snappy, org.json, chunks of thrift, hadoop.io (?!), avro, commons,
protobuf, and guava. If ORC authors are not interested in improving their
dependency hygene, they have to live with the fact that their project is
unlikely to get integrated into other projects.

This is self-inflicted jar hell. Please don't do this. When ORC cleans up their
dependencies, let's revisit.

was (Author: dvryaboy):
I am pro adding ORC support in Pig, but against introducing massive
dependencies.

According to http://mvnrepository.com/artifact/org.apache.hive/hive-exec/0.12.0
the hive-exec jar for 0.12 is 9 megs, and hides within it specific versions of
jackson, snappy, org.json, chunks of thrift, hadoop.io (?!), avro, commons,
protobuf, and guava. If ORC authors are not interested in reducing their
dependency hygene, they have to live with the fact that their project is
unlikely to get integrated into other projects.

This is self-inflicted jar hell. Please don't do this. When ORC cleans up their
dependencies, let's revisit.

ORC support for Pig
---

Key: PIG-3558
URL: https://issues.apache.org/jira/browse/PIG-3558
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
Fix For: 0.13.0

Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch

Adding LoadFunc and StoreFunc for ORC.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3558) ORC support for Pig

2014-02-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904590#comment-13904590
 ] 

Dmitriy V. Ryaboy commented on PIG-3558:


So that's a -1.

I would +1 this if it was going into piggybank.

Since this depends on unpublished changes, I'd rather we unlink it from 0.13 
release (as that would tie us to Hive's release schedule -- obviously we can't 
make a release that depends on a snapshot).

 ORC support for Pig
 ---

 Key: PIG-3558
 URL: https://issues.apache.org/jira/browse/PIG-3558
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.13.0

 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch


 Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3558) ORC support for Pig

2014-02-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904627#comment-13904627
 ] 

Dmitriy V. Ryaboy commented on PIG-3558:


[~daijy] not quite:

{code}
-  conf=test-master /
+  conf=compile-master /
{code}

 ORC support for Pig
 ---

 Key: PIG-3558
 URL: https://issues.apache.org/jira/browse/PIG-3558
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.13.0

 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch


 Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3558) ORC support for Pig

2014-02-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904663#comment-13904663
 ] 

Dmitriy V. Ryaboy commented on PIG-3558:


Help me understand this. My understanding is as follows: 

Compile is minimum required to compile main code. Test is minimum required to 
compile main code + stuff needed to test (hence, extends). Pushing a 
dependency up to compile means everything, not just test, needs the dependency. 

Also, the bump from 0.8 to 0.12 is 6 megs worth of code. That's a pretty big 
version bump.

 ORC support for Pig
 ---

 Key: PIG-3558
 URL: https://issues.apache.org/jira/browse/PIG-3558
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.13.0

 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch


 Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3456) Reduce threadlocal conf access in backend for each record

2014-02-09 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13896199#comment-13896199
 ] 

Dmitriy V. Ryaboy commented on PIG-3456:


Added a couple minor comments. Good change overall.

BTW not sure if you saw, but PIG-3325 addressed the bag insertion regression 
you saw as a side effect of PIG-2923 without sacrificing the memory and gc 
benefits 2923 provides, so if you still have that reverted in your build, 
consider un-reverting..

 Reduce threadlocal conf access in backend for each record
 -

 Key: PIG-3456
 URL: https://issues.apache.org/jira/browse/PIG-3456
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11.1
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.13.0

 Attachments: PIG-3456-1-no-whitespace.patch, PIG-3456-1.patch


 Noticed few things while browsing code
 1) DefaultTuple has a protected boolean isNull = false; which is never used. 
 Removing this gives ~3-5% improvement for big jobs
 2) Config checking with ThreadLocal conf is repeatedly done for each record. 
 For eg: createDataBag in POCombinerPackage. But initialized only for first 
 time in other places like POPackage, POJoinPackage, etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3456) Reduce threadlocal conf access in backend for each record

2014-01-30 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887098#comment-13887098
 ] 

Dmitriy V. Ryaboy commented on PIG-3456:


Could you post a patch without the whitespace changes (for ease of review) and 
some microbenchmark results?

I had some microbenchmark code in PIG-3325, that might help bootstrap you here.

 Reduce threadlocal conf access in backend for each record
 -

 Key: PIG-3456
 URL: https://issues.apache.org/jira/browse/PIG-3456
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11.1
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.13.0

 Attachments: PIG-3456-1.patch


 Noticed few things while browsing code
 1) DefaultTuple has a protected boolean isNull = false; which is never used. 
 Removing this gives ~3-5% improvement for big jobs
 2) Config checking with ThreadLocal conf is repeatedly done for each record. 
 For eg: createDataBag in POCombinerPackage. But initialized only for first 
 time in other places like POPackage, POJoinPackage, etc.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PIG-3672) pig should not hardcode hdfs:// path in code, should be configurable to other file system implementations

2014-01-30 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3672:
---

Status: Open  (was: Patch Available)

cancelling patch available status given Rohini's comments -- please make patch 
available again when a new patch is submitted

 pig should not hardcode hdfs:// path in code, should be configurable to 
 other file system implementations
 ---

 Key: PIG-3672
 URL: https://issues.apache.org/jira/browse/PIG-3672
 Project: Pig
  Issue Type: Bug
  Components: data, parser
Affects Versions: 0.11.1, 0.12.0, 0.10.0
Reporter: Suhas Satish
Assignee: Suhas Satish
 Attachments: PIG-3672-1.patch, PIG-3672-2.patch, PIG-3672.patch


 QueryParserUtils.java has the code - 
 result.add(hdfs://+thisHost+:+uri.getPort());
 I propose to make it generic like - 
 result.add(uri.getScheme() + ://+thisHost+:+uri.getPort());
 Similarly jobControlCompiler.java has - 
 if (!outputPathString.contains(://) || 
 outputPathString.startsWith(hdfs://)) {
  I have a patch version which I ran passing unit tests on. Will be uploading 
 it shortly.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3299) Provide support for LazyOutputFormat to avoid creating empty files

2014-01-30 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887109#comment-13887109
 ] 

Dmitriy V. Ryaboy commented on PIG-3299:


[~daijy] shall we commit this?

 Provide support for LazyOutputFormat to avoid creating empty files
 --

 Key: PIG-3299
 URL: https://issues.apache.org/jira/browse/PIG-3299
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11.1
Reporter: Rohini Palaniswamy
Assignee: Lorand Bendig
 Attachments: PIG-3299.patch


 LazyOutputFormat (HADOOP-4927) in hadoop is a wrapper to avoid creating part 
 files if there is no records output. It would be good to add support for that 
 by having a configuration in pig which wraps storeFunc.getOutputFormat() with 
 LazyOutputFormat. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PIG-3347) Store invocation brings side effect

2014-01-30 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3347:
---

Priority: Critical  (was: Major)

 Store invocation brings side effect
 ---

 Key: PIG-3347
 URL: https://issues.apache.org/jira/browse/PIG-3347
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.11
 Environment: local mode
Reporter: Sergey
Assignee: Daniel Dai
Priority: Critical
 Fix For: 0.12.1

 Attachments: PIG-3347-1.patch


 The problem is that intermediate 'store' invocation changes the final store 
 output. Looks like it brings some kind of side effect. We did use 'local' 
 mode to run script
 here is the input data:
 1
 1
 Here is the script:
 {code}
 a = load 'test';
 a_group = group a by $0;
 b = foreach a_group {
   a_distinct = distinct a.$0;
   generate group, a_distinct;
 }
 --store b into 'b';
 c = filter b by SIZE(a_distinct) == 1;
 store c into 'out';
 {code}
 We expect output to be:
 1 1
 The output is empty file.
 Uncomment {code}--store b into 'b';{code} line and see the diffrence.
 Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-01-30 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887114#comment-13887114
 ] 

Dmitriy V. Ryaboy commented on PIG-3347:


Yikes.

[~aniket486]  [~julienledem] this seems like a critical bug to look at. 
Julien, you investigated this UID situation before, right?

 Store invocation brings side effect
 ---

 Key: PIG-3347
 URL: https://issues.apache.org/jira/browse/PIG-3347
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.11
 Environment: local mode
Reporter: Sergey
Assignee: Daniel Dai
Priority: Critical
 Fix For: 0.12.1

 Attachments: PIG-3347-1.patch


 The problem is that intermediate 'store' invocation changes the final store 
 output. Looks like it brings some kind of side effect. We did use 'local' 
 mode to run script
 here is the input data:
 1
 1
 Here is the script:
 {code}
 a = load 'test';
 a_group = group a by $0;
 b = foreach a_group {
   a_distinct = distinct a.$0;
   generate group, a_distinct;
 }
 --store b into 'b';
 c = filter b by SIZE(a_distinct) == 1;
 store c into 'out';
 {code}
 We expect output to be:
 1 1
 The output is empty file.
 Uncomment {code}--store b into 'b';{code} line and see the diffrence.
 Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache

2014-01-27 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883478#comment-13883478
 ] 

Dmitriy V. Ryaboy commented on PIG-2672:


[~knoguchi] in the spirit of keeping things moving -- can we commit this? You 
can feel free to turn the behavior off on your cluster if you are worried about 
the 1 week boundary. If that's the case, feel free to open another ticket to 
follow up, or to make sure that YARN-1492 fixes your issue.

 Optimize the use of DistributedCache
 

 Key: PIG-2672
 URL: https://issues.apache.org/jira/browse/PIG-2672
 Project: Pig
  Issue Type: Improvement
Reporter: Rohini Palaniswamy
 Fix For: 0.13.0

 Attachments: PIG-2672-5.patch, PIG-2672.patch


 Pig currently copies jar files to a temporary location in hdfs and then adds 
 them to DistributedCache for each job launched. This is inefficient in terms 
 of 
* Space - The jars are distributed to task trackers for every job taking 
 up lot of local temporary space in tasktrackers.
* Performance - The jar distribution impacts the job launch time.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache

2014-01-23 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880084#comment-13880084
 ] 

Dmitriy V. Ryaboy commented on PIG-2672:


Seems like there is a lot of effort being spent here reinventing what is 
already designed for the general use case in the yarn ticket Aniket linked. 
Lets not let best be enemy of the good, and just get something in that will be 
decent for most cases, and if people don't like it, they can turn it off. This 
is an intermediate solution until that yarn patch goes in, at which point all 
of this becomes moot. 


 Optimize the use of DistributedCache
 

 Key: PIG-2672
 URL: https://issues.apache.org/jira/browse/PIG-2672
 Project: Pig
  Issue Type: Improvement
Reporter: Rohini Palaniswamy
 Fix For: 0.13.0

 Attachments: PIG-2672-5.patch, PIG-2672.patch


 Pig currently copies jar files to a temporary location in hdfs and then adds 
 them to DistributedCache for each job launched. This is inefficient in terms 
 of 
* Space - The jars are distributed to task trackers for every job taking 
 up lot of local temporary space in tasktrackers.
* Performance - The jar distribution impacts the job launch time.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(

2013-12-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852133#comment-13852133
 ] 

Dmitriy V. Ryaboy commented on PIG-3630:


Is this a AvroStorage or data issue? 

grunt import '/Users/dmitriy/tmp/tf_idf.macro';
  
grunt register build/ivy/lib/Pig/avro-1.7.4.jar
  
grunt register build/ivy/lib/Pig/json-simple-1.1.jar   
  
grunt register contrib/piggybank/java/piggybank.jar
  
grunt define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();  
  
grunt emails = load '/Users/dmitriy/Downloads/enron.avro';   
grunt describe emails
Schema for emails unknown.

(this is the same in both pig 0.11 and pig 0.12).

Can you provide a simple reproducible use case that doesn't involve Avro, etc?

Can you share what debugging you've done so far?




 Macros that work in Pig 0.11 fail in Pig 0.12 :(
 

 Key: PIG-3630
 URL: https://issues.apache.org/jira/browse/PIG-3630
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.12.0
Reporter: Russell Jurney

 http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html
 The ntf-idf macro listed there works under 0.11. Under 0.12, it results in 
 this: 
 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Apache Pig version 
 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /private/tmp/pig_1387260559120.log
 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from 
 SCDynamicStore
 2013-12-16 22:09:19,528 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: file:///
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)
   at 
 org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142)
   at 
 org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
   at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387)
   at org.apache.pig.PigServer.execute(PigServer.java:1302)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:391)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at

[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(

2013-12-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852221#comment-13852221
 ] 

Dmitriy V. Ryaboy commented on PIG-3630:


Sure enough. 

Once I add that, everything works in 0.12 and now I can't reproduce the bug you 
are reporting.

My pig is
[tw-mbp13-dryaboy-2 pig-0.12]$ ./bin/pig -version
Apache Pig version 0.12.0-SNAPSHOT (r1526044) 
compiled Dec 18 2013, 12:15:04

same with more recent:
[tw-mbp13-dryaboy-2 pig-0.12]$ ./bin/pig -version
Apache Pig version 0.12.1-SNAPSHOT (r1552124) 
compiled Dec 18 2013, 14:00:21


Back to you to get a reproducible test case

 Macros that work in Pig 0.11 fail in Pig 0.12 :(
 

 Key: PIG-3630
 URL: https://issues.apache.org/jira/browse/PIG-3630
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.12.0
Reporter: Russell Jurney

 http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html
 The ntf-idf macro listed there works under 0.11. Under 0.12, it results in 
 this: 
 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Apache Pig version 
 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /private/tmp/pig_1387260559120.log
 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from 
 SCDynamicStore
 2013-12-16 22:09:19,528 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: file:///
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)
   at 
 org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142)
   at 
 org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
   at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387)
   at org.apache.pig.PigServer.execute(PigServer.java:1302)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:391)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:600)
   at org.apache.pig.Main.main(Main.java:156)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at

[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(

2013-12-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852272#comment-13852272
 ] 

Dmitriy V. Ryaboy commented on PIG-3630:


That one fails in both 0.11 and 0.12.

Do you have something that works in 11 but fails in 12?

 Macros that work in Pig 0.11 fail in Pig 0.12 :(
 

 Key: PIG-3630
 URL: https://issues.apache.org/jira/browse/PIG-3630
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.12.0
Reporter: Russell Jurney

 http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html
 The ntf-idf macro listed there works under 0.11. Under 0.12, it results in 
 this: 
 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Apache Pig version 
 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /private/tmp/pig_1387260559120.log
 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from 
 SCDynamicStore
 2013-12-16 22:09:19,528 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: file:///
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)
   at 
 org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142)
   at 
 org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
   at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387)
   at org.apache.pig.PigServer.execute(PigServer.java:1302)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:391)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:600)
   at org.apache.pig.Main.main(Main.java:156)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(

2013-12-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852281#comment-13852281
 ] 

Dmitriy V. Ryaboy commented on PIG-3630:


Actually that failed in 11 due to missing register statements. It does work in 
11 if you work around the Avro stuff. Ok, now we have something to look at...

 Macros that work in Pig 0.11 fail in Pig 0.12 :(
 

 Key: PIG-3630
 URL: https://issues.apache.org/jira/browse/PIG-3630
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.12.0
Reporter: Russell Jurney

 http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html
 The ntf-idf macro listed there works under 0.11. Under 0.12, it results in 
 this: 
 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Apache Pig version 
 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /private/tmp/pig_1387260559120.log
 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from 
 SCDynamicStore
 2013-12-16 22:09:19,528 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: file:///
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)
   at 
 org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142)
   at 
 org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
   at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387)
   at org.apache.pig.PigServer.execute(PigServer.java:1302)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:391)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:600)
   at org.apache.pig.Main.main(Main.java:156)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(

2013-12-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852299#comment-13852299
 ] 

Dmitriy V. Ryaboy commented on PIG-3630:


Now that registers are in place, it works in 12 as well:

{code}
Input(s):
Successfully read records from: /Users/dmitriy/Downloads/trimmed_reviews.avro

Output(s):
Successfully stored records in: 
file:///Users/dmitriy/src/pig-0.12/tmp/pig_12_ntf_idf_scores

Job DAG:
job_local_0001  -  job_local_0003,job_local_0002,
job_local_0003  -  job_local_0005,
job_local_0002  -  job_local_0004,
job_local_0004  -  job_local_0005,
job_local_0005


2013-12-18 15:22:02,012 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Success!
{code}

Back to you...

 Macros that work in Pig 0.11 fail in Pig 0.12 :(
 

 Key: PIG-3630
 URL: https://issues.apache.org/jira/browse/PIG-3630
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.12.0
Reporter: Russell Jurney

 http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html
 The ntf-idf macro listed there works under 0.11. Under 0.12, it results in 
 this: 
 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Apache Pig version 
 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /private/tmp/pig_1387260559120.log
 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from 
 SCDynamicStore
 2013-12-16 22:09:19,528 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: file:///
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)
   at 
 org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142)
   at 
 org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
   at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387)
   at org.apache.pig.PigServer.execute(PigServer.java:1302)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:391)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:600)
   at org.apache.pig.Main.main(Main.java:156)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

[jira] [Assigned] (PIG-3621) Python Avro library can't read Avros made with builtin AvroStorage

2013-12-18 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-3621:
--

Assignee: (was: Dmitriy V. Ryaboy)

Uh, no thanks :)

 Python Avro library can't read Avros made with builtin AvroStorage
 --

 Key: PIG-3621
 URL: https://issues.apache.org/jira/browse/PIG-3621
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Affects Versions: 0.12.0
Reporter: Russell Jurney
 Fix For: 0.12.1, 0.13.0

 Attachments: PIG-3631-2.patch, PIG-3631.patch


 Using this script:
 from avro import schema, datafile, io
 import pprint
 import sys
 import json
 field_id = None
 # Optional key to print
 if (len(sys.argv)  2):
   field_id = sys.argv[2]
 # Test reading avros
 rec_reader = io.DatumReader()
 # Create a 'data file' (avro file) reader
 df_reader = datafile.DataFileReader(
   open(sys.argv[1]),
   rec_reader
 )
 the last line fails with:
 Traceback (most recent call last):
   File /Users/rjurney/bin/cat_avro, line 22, in module
 rec_reader
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/datafile.py,
  line 247, in __init__
 self.datum_reader.writers_schema = schema.parse(self.get_meta(SCHEMA_KEY))
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 784, in parse
 return make_avsc_object(json_data, names)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 740, in make_avsc_object
 return RecordSchema(name, namespace, fields, names, type, doc, 
 other_props)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 653, in __init__
 other_props)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 294, in __init__
 new_name = names.add_name(name, namespace, self)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 268, in add_name
 raise SchemaParseException(fail_msg)
 avro.schema.SchemaParseException: record is a reserved type name.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PIG-3621) Python Avro library can't read Avros made with builtin AvroStorage

2013-12-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852441#comment-13852441
 ] 

Dmitriy V. Ryaboy commented on PIG-3621:


Sorry, that was a no to the assignment. Cheolsoo, does that var get set 
elsewhere? Why remove the logic for checking empty string, etc, and using a 
default?

 Python Avro library can't read Avros made with builtin AvroStorage
 --

 Key: PIG-3621
 URL: https://issues.apache.org/jira/browse/PIG-3621
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Affects Versions: 0.12.0
Reporter: Russell Jurney
 Fix For: 0.12.1, 0.13.0

 Attachments: PIG-3631-2.patch, PIG-3631.patch


 Using this script:
 from avro import schema, datafile, io
 import pprint
 import sys
 import json
 field_id = None
 # Optional key to print
 if (len(sys.argv)  2):
   field_id = sys.argv[2]
 # Test reading avros
 rec_reader = io.DatumReader()
 # Create a 'data file' (avro file) reader
 df_reader = datafile.DataFileReader(
   open(sys.argv[1]),
   rec_reader
 )
 the last line fails with:
 Traceback (most recent call last):
   File /Users/rjurney/bin/cat_avro, line 22, in module
 rec_reader
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/datafile.py,
  line 247, in __init__
 self.datum_reader.writers_schema = schema.parse(self.get_meta(SCHEMA_KEY))
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 784, in parse
 return make_avsc_object(json_data, names)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 740, in make_avsc_object
 return RecordSchema(name, namespace, fields, names, type, doc, 
 other_props)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 653, in __init__
 other_props)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 294, in __init__
 new_name = names.add_name(name, namespace, self)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 268, in add_name
 raise SchemaParseException(fail_msg)
 avro.schema.SchemaParseException: record is a reserved type name.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PIG-3621) Python Avro library can't read Avros made with builtin AvroStorage

2013-12-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852500#comment-13852500
 ] 

Dmitriy V. Ryaboy commented on PIG-3621:


+1

 Python Avro library can't read Avros made with builtin AvroStorage
 --

 Key: PIG-3621
 URL: https://issues.apache.org/jira/browse/PIG-3621
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Affects Versions: 0.12.0
Reporter: Russell Jurney
 Fix For: 0.12.1, 0.13.0

 Attachments: PIG-3621-3.patch, PIG-3631-2.patch, PIG-3631.patch


 Using this script:
 from avro import schema, datafile, io
 import pprint
 import sys
 import json
 field_id = None
 # Optional key to print
 if (len(sys.argv)  2):
   field_id = sys.argv[2]
 # Test reading avros
 rec_reader = io.DatumReader()
 # Create a 'data file' (avro file) reader
 df_reader = datafile.DataFileReader(
   open(sys.argv[1]),
   rec_reader
 )
 the last line fails with:
 Traceback (most recent call last):
   File /Users/rjurney/bin/cat_avro, line 22, in module
 rec_reader
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/datafile.py,
  line 247, in __init__
 self.datum_reader.writers_schema = schema.parse(self.get_meta(SCHEMA_KEY))
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 784, in parse
 return make_avsc_object(json_data, names)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 740, in make_avsc_object
 return RecordSchema(name, namespace, fields, names, type, doc, 
 other_props)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 653, in __init__
 other_props)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 294, in __init__
 new_name = names.add_name(name, namespace, self)
   File 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py,
  line 268, in add_name
 raise SchemaParseException(fail_msg)
 avro.schema.SchemaParseException: record is a reserved type name.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(

2013-12-17 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851239#comment-13851239
 ] 

Dmitriy V. Ryaboy commented on PIG-3630:


Could you link to the code directly, rather than the book? The Safari website 
is giving me interstitials and other unpleasant things.

Have you investigated the schemas of relations referred to in the error 
message, and checked if your field references make sense?

2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1025: 
at expanding macro 'tf_idf' (per_business.pig:9)
file per_business.pig, line 35, column 17 Invalid field projection. Projected 
field [tf_idf] does not exist in schema: 
business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.

 Macros that work in Pig 0.11 fail in Pig 0.12 :(
 

 Key: PIG-3630
 URL: https://issues.apache.org/jira/browse/PIG-3630
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.12.0
Reporter: Russell Jurney

 http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html
 The ntf-idf macro listed there works under 0.11. Under 0.12, it results in 
 this: 
 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Apache Pig version 
 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /private/tmp/pig_1387260559120.log
 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from 
 SCDynamicStore
 2013-12-16 22:09:19,528 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: file:///
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)
   at 
 org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142)
   at 
 org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
   at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387)
   at org.apache.pig.PigServer.execute(PigServer.java:1302)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:391)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:600)
   at org.apache.pig.Main.main(Main.java:156)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(

2013-12-17 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851412#comment-13851412
 ] 

Dmitriy V. Ryaboy commented on PIG-3630:


That macro does not refer to a field called tf_idf. Could you post a fully 
reproducible test case?

 Macros that work in Pig 0.11 fail in Pig 0.12 :(
 

 Key: PIG-3630
 URL: https://issues.apache.org/jira/browse/PIG-3630
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.12.0
Reporter: Russell Jurney

 http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html
 The ntf-idf macro listed there works under 0.11. Under 0.12, it results in 
 this: 
 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Apache Pig version 
 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29
 2013-12-16 22:09:19,159 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /private/tmp/pig_1387260559120.log
 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from 
 SCDynamicStore
 2013-12-16 22:09:19,528 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: file:///
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: 
 at expanding macro 'tf_idf' (per_business.pig:9)
 file per_business.pig, line 35, column 17 Invalid field projection. 
 Projected field [tf_idf] does not exist in schema: 
 business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long.
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)
   at 
 org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53)
   at 
 org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142)
   at 
 org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at 
 org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
   at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387)
   at org.apache.pig.PigServer.execute(PigServer.java:1302)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:391)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:600)
   at org.apache.pig.Main.main(Main.java:156)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-04 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813080#comment-13813080
]

Dmitriy V. Ryaboy commented on PIG-3453:

Mridul:
In our experience at Twitter, Trident introduces pretty high overhead; in
Summingbird, we relax the data delivery guarantees to get better throughput,
and use Storm directly. Perhaps you want to try putting pig on top of
Summingbird? If you did that, we might even be able to help :). In any case,
interested in seeing how all of this will turn out.

Cheolsoo:
No real objections to svn branch. In the past I've found it far easier to
cooperate on significant branches on github, rather than maintain an svn branch
(you can easily have multiple branches, reviews are easier, etc). That's how
Bill Graham and I did the HBaseStorage rewrite a few years back. But really
that's up to developers doing the work.

Implement a Storm backend to Pig

Key: PIG-3453
URL: https://issues.apache.org/jira/browse/PIG-3453
Project: Pig
Issue Type: New Feature
Affects Versions: 0.13.0
Reporter: Pradeep Gollakota
Assignee: Jacob Perkins
Labels: storm
Fix For: 0.13.0

Attachments: storm-integration.patch

There is a lot of interest around implementing a Storm backend to Pig for
streaming processing. The proposal and initial discussions can be found at
https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-04 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813150#comment-13813150
 ] 

Dmitriy V. Ryaboy commented on PIG-3453:


Oh I absolutely just meant collaboration on initial contrib to happen in 
github, for expediency. and fast iteration.
Of course once this work is in a committable/mergeable state, it should go into 
Apache.

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.13.0
Reporter: Pradeep Gollakota
Assignee: Jacob Perkins
  Labels: storm
 Fix For: 0.13.0

 Attachments: storm-integration.patch


 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811726#comment-13811726
 ] 

Dmitriy V. Ryaboy commented on PIG-3453:


I don't see why Jacob can't keep working in a github branch... easier to look 
at what's changing, and he can keep merging the (read-only) git mirror from 
apache to keep up with changes.

Jacob I see you are using Trident. Have you looked at your throughput numbers, 
vs going directly to storm?

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.13.0
Reporter: Pradeep Gollakota
Assignee: Jacob Perkins
  Labels: storm
 Fix For: 0.13.0

 Attachments: storm-integration.patch


 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PIG-3549) Print hadoop jobids for failed, killed job

2013-10-29 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808007#comment-13808007
 ] 

Dmitriy V. Ryaboy commented on PIG-3549:


OMG. Thanks.
+1.

 Print hadoop jobids for failed, killed job
 --

 Key: PIG-3549
 URL: https://issues.apache.org/jira/browse/PIG-3549
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.12.1

 Attachments: PIG-3549.patch


 It would be better if we dump the hadoop job ids for failed, killed jobs in 
 pig log. Right now, log looks like following-
 {noformat}
 ERROR org.apache.pig.tools.grunt.Grunt: ERROR 6017: Job failed! Error - NA
 INFO 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher:
  Job job_pigexec_1 killed
 {noformat}
 From that its hard to say which hadoop job failed if there are multiple jobs 
 running in parallel.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-10-28 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807458#comment-13807458
 ] 

Dmitriy V. Ryaboy commented on PIG-3453:


[~azaroth]: may I suggest https://github.com/twitter/algebird for this and many 
other approximate counting use cases? :-) Already in use by scalding, 
summingbird, and spark.


 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Reporter: Pradeep Gollakota
  Labels: storm

 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig

2013-10-02 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784627#comment-13784627
 ] 

Dmitriy V. Ryaboy commented on PIG-3445:


That's a great addition, thanks Lorand.

The code looks really tidy now.

Looks like ParquetUtil is actually general util? Maybe add that functionality 
to org.apache.pig.impl.util.JarManager or something along those lines?

[~julienledem] do we need to publish a new artifact version so fastutil isn't 
required for dictionary encoding?


 Make Parquet format available out of the box in Pig
 ---

 Key: PIG-3445
 URL: https://issues.apache.org/jira/browse/PIG-3445
 Project: Pig
  Issue Type: Improvement
Reporter: Julien Le Dem
 Fix For: 0.12.0

 Attachments: PIG-3445-2.patch, PIG-3445-3.patch, PIG-3445.patch


 We would add the Parquet jar in the Pig packages to make it available out of 
 the box to pig users.
 On top of that we could add the parquet.pig package to the list of packages 
 to search for UDFs. (alternatively, the parquet jar could contain classes 
 name or.apache.pig.builtin.ParquetLoader and ParquetStorer)
 This way users can use Parquet simply by typing:
 A = LOAD 'foo' USING ParquetLoader();
 STORE A INTO 'bar' USING ParquetStorer();



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PIG-3082) outputSchema of a UDF allows two usages when describing a Tuple schema

2013-10-02 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784720#comment-13784720
 ] 

Dmitriy V. Ryaboy commented on PIG-3082:


So... that's a breaking change, a bunch of UDF will fail under 12. 

Intended?

 outputSchema of a UDF allows two usages when describing a Tuple schema
 --

 Key: PIG-3082
 URL: https://issues.apache.org/jira/browse/PIG-3082
 Project: Pig
  Issue Type: Bug
Reporter: Julien Le Dem
Assignee: Jonathan Coveney
 Fix For: 0.12.0

 Attachments: PIG-3082-0.patch, PIG-3082-1.patch


 When defining an evalfunc that returns a Tuple there are two ways you can 
 implement outputSchema().
 - The right way: return a schema that contains one Field that contains the 
 type and schema of the return type of the UDF
 - The unreliable way: return a schema that contains more than one field and 
 it will be understood as a tuple schema even though there is no type (which 
 is in Field class) to specify that. This is particularly deceitful when the 
 output schema is derived from the input schema and the outputted Tuple 
 sometimes contain only one field. In such cases Pig understands the output 
 schema as a tuple only if there is more than one field. And sometimes it 
 works, sometimes it does not.
 We should at least issue a warning (backward compatibility) if not plain 
 throw an exception when the output schema contains more than one Field.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig

2013-10-01 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13783614#comment-13783614
 ] 

Dmitriy V. Ryaboy commented on PIG-3445:


[~lbendig] might be more succinct to use StoreFuncWrapper ?

 Make Parquet format available out of the box in Pig
 ---

 Key: PIG-3445
 URL: https://issues.apache.org/jira/browse/PIG-3445
 Project: Pig
  Issue Type: Improvement
Reporter: Julien Le Dem
 Fix For: 0.12.0

 Attachments: PIG-3445-2.patch, PIG-3445.patch


 We would add the Parquet jar in the Pig packages to make it available out of 
 the box to pig users.
 On top of that we could add the parquet.pig package to the list of packages 
 to search for UDFs. (alternatively, the parquet jar could contain classes 
 name or.apache.pig.builtin.ParquetLoader and ParquetStorer)
 This way users can use Parquet simply by typing:
 A = LOAD 'foo' USING ParquetLoader();
 STORE A INTO 'bar' USING ParquetStorer();



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases

2013-09-30 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13782112#comment-13782112
 ] 

Dmitriy V. Ryaboy commented on PIG-3480:


That is fine with me, lets make sequence file optional. It will let people 
avoid the bug I am encountering, an also do things like use snappy compression. 


 TFile-based tmpfile compression crashes in some cases
 -

 Key: PIG-3480
 URL: https://issues.apache.org/jira/browse/PIG-3480
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.12.0

 Attachments: PIG-3480.patch


 When pig tmpfile compression is on, some jobs fail inside core hadoop 
 internals.
 Suspect TFile is the problem, because an experiment in replacing TFile with 
 SequenceFile succeeded.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3325:
---

Affects Version/s: 0.12

 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.12, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Dmitriy V. Ryaboy
Priority: Critical
 Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, 
 PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3479:
---

Fix Version/s: 0.12

 Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable 
 deserilization
 --

 Key: PIG-3479
 URL: https://issues.apache.org/jira/browse/PIG-3479
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.12

 Attachments: PIG-3479.patch


 While working on something unrelated I discovered some serialization errors 
 with recently added data types, and a heavy use of reflection slowing down 
 PigNullableWritable deserialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3325:
---

Fix Version/s: 0.12

 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.12, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Dmitriy V. Ryaboy
Priority: Critical
 Fix For: 0.12

 Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, 
 PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776172#comment-13776172
 ] 

Dmitriy V. Ryaboy commented on PIG-3445:


The size of the dependency introduced by this is orders of magnitude smaller 
than the HBase (or Avro) one, since everything comes from a single project 
(unlike HBase's liberal use of guava, metric, ZK, and everything else under the 
sun). The total size is less than 1 meg.

Can we add parquet.pig to udf import list in the same patch?

 Make Parquet format available out of the box in Pig
 ---

 Key: PIG-3445
 URL: https://issues.apache.org/jira/browse/PIG-3445
 Project: Pig
  Issue Type: Improvement
Reporter: Julien Le Dem
 Attachments: PIG-3445.patch


 We would add the Parquet jar in the Pig packages to make it available out of 
 the box to pig users.
 On top of that we could add the parquet.pig package to the list of packages 
 to search for UDFs. (alternatively, the parquet jar could contain classes 
 name or.apache.pig.builtin.ParquetLoader and ParquetStorer)
 This way users can use Parquet simply by typing:
 A = LOAD 'foo' USING ParquetLoader();
 STORE A INTO 'bar' USING ParquetStorer();

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3445) Make Parquet format available out of the box in Pig

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3445:
---

Fix Version/s: 0.12

 Make Parquet format available out of the box in Pig
 ---

 Key: PIG-3445
 URL: https://issues.apache.org/jira/browse/PIG-3445
 Project: Pig
  Issue Type: Improvement
Reporter: Julien Le Dem
 Fix For: 0.12

 Attachments: PIG-3445.patch


 We would add the Parquet jar in the Pig packages to make it available out of 
 the box to pig users.
 On top of that we could add the parquet.pig package to the list of packages 
 to search for UDFs. (alternatively, the parquet jar could contain classes 
 name or.apache.pig.builtin.ParquetLoader and ParquetStorer)
 This way users can use Parquet simply by typing:
 A = LOAD 'foo' USING ParquetLoader();
 STORE A INTO 'bar' USING ParquetStorer();

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3480) TFile-based tmpfile compression crashes in some cases

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)

Dmitriy V. Ryaboy created PIG-3480:
--

 Summary: TFile-based tmpfile compression crashes in some cases
 Key: PIG-3480
 URL: https://issues.apache.org/jira/browse/PIG-3480
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.12


When pig tmpfile compression is on, some jobs fail inside core hadoop internals.
Suspect TFile is the problem, because an experiment in replacing TFile with 
SequenceFile succeeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776602#comment-13776602
 ] 

Dmitriy V. Ryaboy commented on PIG-3480:


For most of the tasks that fail, no stack trace is available on Hadoop 1 (they 
just die with nonzero status 134).

I did catch one task with a stack trace:
{code}
java.io.IOException: Error while reading compressed data at 
org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at 
org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at 
org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at 
org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at 
org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at 
org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at 
org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at 
org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at 
org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) 
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at 
org.apache.hadoop.mapred.MapTask.run(Map
{code}

No idea if this is relevant.

This problem does happen consistently -- 100% of the time on my script that 
shows this problem. Anecdotally, about 1/10 of our production scripts encounter 
this; I have not been able to establish a pattern yet.

 TFile-based tmpfile compression crashes in some cases
 -

 Key: PIG-3480
 URL: https://issues.apache.org/jira/browse/PIG-3480
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.12


 When pig tmpfile compression is on, some jobs fail inside core hadoop 
 internals.
 Suspect TFile is the problem, because an experiment in replacing TFile with 
 SequenceFile succeeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (PIG-3480) TFile-based tmpfile compression crashes in some cases

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776602#comment-13776602
 ] 

Dmitriy V. Ryaboy edited comment on PIG-3480 at 9/24/13 6:36 PM:
-

For most of the tasks that fail, no stack trace is available on Hadoop 1 (they 
just die with nonzero status 134).

I did catch one task with a stack trace:
{code}
java.io.IOException: Error while reading compressed data at
org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at 
org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at 
org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at 
org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at 
org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at 
org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at 
org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at 
org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at 
org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) 
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at 
org.apache.hadoop.mapred.MapTask.run(Map
{code}

No idea if this is relevant.

This problem does happen consistently -- 100% of the time on my script that 
shows this problem. Anecdotally, about 1/10 of our production scripts encounter 
this; I have not been able to establish a pattern yet.

  was (Author: dvryaboy):
For most of the tasks that fail, no stack trace is available on Hadoop 1 
(they just die with nonzero status 134).

I did catch one task with a stack trace:
{code}
java.io.IOException: Error while reading compressed data at 
org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at 
org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at 
org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at 
org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at 
org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at 
org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at 
org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at 
org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at 
org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) 
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at 
org.apache.hadoop.mapred.MapTask.run(Map
{code}

No idea if this is relevant.

This problem does happen consistently -- 100% of the time on my script that 
shows this problem. Anecdotally, about 1/10 of our production scripts encounter 
this; I have not been able to establish a pattern yet.
  
 TFile-based tmpfile compression crashes in some cases
 -

 Key: PIG-3480
 URL: https://issues.apache.org/jira/browse/PIG-3480
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.12


 When pig tmpfile compression is on, some jobs fail inside core hadoop 
 internals.
 Suspect TFile is the problem, because an experiment in replacing TFile with 
 SequenceFile succeeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3480) TFile-based tmpfile compression crashes in some cases

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3480:
---

Attachment: PIG-3480.patch

Attaching a rough patch which replaces use of TFile with SequenceFile.

Next steps:
- evaluate effect on size of compressed data for TFile vs SeqFile when TFile 
does work
- add tests, make TFile tests pass (in this file they fail, because of course 
TFile is not being used)
- make SeqFile the default method, since it doesn't break
- allow TFile use by a switch, since current users may want to keep it. I would 
prefer to not do that, but might if the first step shows significant 
differences.

Thoughts?
Especially from folks using TFile-based compression in production ([~rohini]?)

 TFile-based tmpfile compression crashes in some cases
 -

 Key: PIG-3480
 URL: https://issues.apache.org/jira/browse/PIG-3480
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.12

 Attachments: PIG-3480.patch


 When pig tmpfile compression is on, some jobs fail inside core hadoop 
 internals.
 Suspect TFile is the problem, because an experiment in replacing TFile with 
 SequenceFile succeeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377
 ] 

Dmitriy V. Ryaboy commented on PIG-3480:


[~knoguchi] yeah, I'm not sure the stack trace is relevant -- it's the only 
part that's not consistent about this.

The problem goes away when I set pig.tmpfilecompression to false, or when I 
replace TFile with SequenceFile.
I've also seen stack traces that were inside TFile, and had to do with some LZO 
decoding issues.. the actual error is really hard to capture, other than the 
fact that mappers fail consistently.

 TFile-based tmpfile compression crashes in some cases
 -

 Key: PIG-3480
 URL: https://issues.apache.org/jira/browse/PIG-3480
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.12

 Attachments: PIG-3480.patch


 When pig tmpfile compression is on, some jobs fail inside core hadoop 
 internals.
 Suspect TFile is the problem, because an experiment in replacing TFile with 
 SequenceFile succeeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776732#comment-13776732
 ] 

Dmitriy V. Ryaboy commented on PIG-3480:


Rohini, do you guys use lzo or gz compression? Maybe it's just lzo that's 
breaking. I can test gz. That never actually occurred to me, I just assumed 
this is completely busted because I could never get it to work (since 2010..)

 TFile-based tmpfile compression crashes in some cases
 -

 Key: PIG-3480
 URL: https://issues.apache.org/jira/browse/PIG-3480
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.12

 Attachments: PIG-3480.patch


 When pig tmpfile compression is on, some jobs fail inside core hadoop 
 internals.
 Suspect TFile is the problem, because an experiment in replacing TFile with 
 SequenceFile succeeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776728#comment-13776728
 ] 

Dmitriy V. Ryaboy commented on PIG-3480:


Rohini I suspect this might be something about complex data types, which afaik 
are pretty rare at Y! and extremely common at Twitter.

 TFile-based tmpfile compression crashes in some cases
 -

 Key: PIG-3480
 URL: https://issues.apache.org/jira/browse/PIG-3480
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.12

 Attachments: PIG-3480.patch


 When pig tmpfile compression is on, some jobs fail inside core hadoop 
 internals.
 Suspect TFile is the problem, because an experiment in replacing TFile with 
 SequenceFile succeeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3479:
---

Attachment: PIG-3479.whitespace.patch

Same patch, but with whitespace changes. Committing this.

 Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable 
 deserilization
 --

 Key: PIG-3479
 URL: https://issues.apache.org/jira/browse/PIG-3479
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.12.0

 Attachments: PIG-3479.patch, PIG-3479.whitespace.patch


 While working on something unrelated I discovered some serialization errors 
 with recently added data types, and a heavy use of reflection slowing down 
 PigNullableWritable deserialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3479:
---

  Resolution: Fixed
Release Note: Skewed join internals improved to get 10% or better 
improvement on reducers by eliminating unnecessary reflection.
  Status: Resolved  (was: Patch Available)

Committed to trunk and 0.12

 Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable 
 deserilization
 --

 Key: PIG-3479
 URL: https://issues.apache.org/jira/browse/PIG-3479
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.12.0

 Attachments: PIG-3479.patch, PIG-3479.whitespace.patch


 While working on something unrelated I discovered some serialization errors 
 with recently added data types, and a heavy use of reflection slowing down 
 PigNullableWritable deserialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig

2013-09-24 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776879#comment-13776879
 ] 

Dmitriy V. Ryaboy commented on PIG-3445:


Other loaders like csv, avro, json, xml, etc (even RC, though it's in piggybank 
due to heavy dependencies and lack of support) are all in already so I don't 
see this as unfair, but as consistent.
Not packaging the pq jars into pig monojar and instead adding them, the way we 
add guava et al for hbase, sounds like a good idea.
[~julienledem] should we do that by providing a simple wrapper in pig builtins, 
or by messing with the job conf in parquet's own loader/storer?

 Make Parquet format available out of the box in Pig
 ---

 Key: PIG-3445
 URL: https://issues.apache.org/jira/browse/PIG-3445
 Project: Pig
  Issue Type: Improvement
Reporter: Julien Le Dem
 Fix For: 0.12.0

 Attachments: PIG-3445.patch


 We would add the Parquet jar in the Pig packages to make it available out of 
 the box to pig users.
 On top of that we could add the parquet.pig package to the list of packages 
 to search for UDFs. (alternatively, the parquet jar could contain classes 
 name or.apache.pig.builtin.ParquetLoader and ParquetStorer)
 This way users can use Parquet simply by typing:
 A = LOAD 'foo' USING ParquetLoader();
 STORE A INTO 'bar' USING ParquetStorer();

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization

2013-09-23 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-3479:
--

Assignee: Dmitriy V. Ryaboy

 Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable 
 deserilization
 --

 Key: PIG-3479
 URL: https://issues.apache.org/jira/browse/PIG-3479
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: PIG-3479.patch


 While working on something unrelated I discovered some serialization errors 
 with recently added data types, and a heavy use of reflection slowing down 
 PigNullableWritable deserialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization

2013-09-23 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3479:
---

Attachment: PIG-3479.patch

Attaching a patch.

I extended an existing test to test the serialziation.. it's the only place we 
test Nullables at all :(.

 Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable 
 deserilization
 --

 Key: PIG-3479
 URL: https://issues.apache.org/jira/browse/PIG-3479
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: PIG-3479.patch


 While working on something unrelated I discovered some serialization errors 
 with recently added data types, and a heavy use of reflection slowing down 
 PigNullableWritable deserialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization

2013-09-23 Thread Dmitriy V. Ryaboy (JIRA)

Dmitriy V. Ryaboy created PIG-3479:
--

 Summary: Fix BigInt, BigDec, Date serialization. Improve perf of 
PigNullableWritable deserilization
 Key: PIG-3479
 URL: https://issues.apache.org/jira/browse/PIG-3479
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Attachments: PIG-3479.patch

While working on something unrelated I discovered some serialization errors 
with recently added data types, and a heavy use of reflection slowing down 
PigNullableWritable deserialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache

2013-09-20 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773636#comment-13773636
 ] 

Dmitriy V. Ryaboy commented on PIG-2672:


Aniket, can we prefix the properties with pig.? That way we won't conflict 
with potential properties from Hadoop, and it's a little easier to analyze 
stuff when looking at the jobconf.

 Optimize the use of DistributedCache
 

 Key: PIG-2672
 URL: https://issues.apache.org/jira/browse/PIG-2672
 Project: Pig
  Issue Type: Improvement
Reporter: Rohini Palaniswamy
Assignee: Aniket Mokashi
 Fix For: 0.12

 Attachments: PIG-2672.patch


 Pig currently copies jar files to a temporary location in hdfs and then adds 
 them to DistributedCache for each job launched. This is inefficient in terms 
 of 
* Space - The jars are distributed to task trackers for every job taking 
 up lot of local temporary space in tasktrackers.
* Performance - The jar distribution impacts the job launch time.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-09-16 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768649#comment-13768649
 ] 

Dmitriy V. Ryaboy commented on PIG-3419:


+1 to marking the interfaces as evolving.

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Fix For: 0.12

 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, 
 updated-8-29-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-09-13 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767220#comment-13767220
 ] 

Dmitriy V. Ryaboy commented on PIG-3419:


This is not just for Tez. The point is to enable POC work (in branches, forks, 
etc) and not have each such attempt redo all the work in this ticket. It's the 
same reason we provide things like pluggable LoadFuncs to let people work on 
things they want to load we didn't think of loading.

We should certainly work to stabilize 0.12 and fix issues like PIG-3457

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Fix For: 0.12

 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, 
 updated-8-29-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2965) RANDOM should allow seed initialization for ease of testing

2013-09-06 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13760313#comment-13760313
 ] 

Dmitriy V. Ryaboy commented on PIG-2965:


A UDF essentially has a constructor and an exec method. foreach lines 
udf(foo) calls the exec method and passes to it the foo parameter. define 
udfinstance udf(foo) passes foo to the constructor, and makes an instance of 
the foo udf initialized in that way bound to udfinstance (so you can have 
many differently initialized udfs in the same script).  You can read more info 
on all this in the docs about define keyword and the UDF author's guide.

 RANDOM should allow seed initialization for ease of testing
 ---

 Key: PIG-2965
 URL: https://issues.apache.org/jira/browse/PIG-2965
 Project: Pig
  Issue Type: Bug
Reporter: Aneesh Sharma
Assignee: Jonathan Coveney
  Labels: newbie
 Attachments: PIG-2965-0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2965) RANDOM should allow seed initialization for ease of testing

2013-09-05 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759609#comment-13759609
 ] 

Dmitriy V. Ryaboy commented on PIG-2965:


[~sdeneefe] are you sure you are using it right? I just tested and it works.

Here's a test script you can run a few times :
{code}
define rand RANDOM('12345');

lines = load 'random.pig';
r = foreach lines generate rand();
dump r;
{code}

run using `pig -x local random.pig 2/dev/null`

 RANDOM should allow seed initialization for ease of testing
 ---

 Key: PIG-2965
 URL: https://issues.apache.org/jira/browse/PIG-2965
 Project: Pig
  Issue Type: Bug
Reporter: Aneesh Sharma
Assignee: Jonathan Coveney
  Labels: newbie
 Attachments: PIG-2965-0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3048) Add mapreduce workflow information to job configuration

2013-08-29 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753942#comment-13753942
]

Dmitriy V. Ryaboy commented on PIG-3048:

no objections. after all, usage of the config info is purely optional.

We've run into trouble before with information of this sort becoming very big
and triggering JobConf too large errors. Might want to look at compression at
some point.

Add mapreduce workflow information to job configuration
---

Key: PIG-3048
URL: https://issues.apache.org/jira/browse/PIG-3048
Project: Pig
Issue Type: Improvement
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
Fix For: 0.11.2

Attachments: PIG-3048.patch, PIG-3048.patch, PIG-3048.patch

Adding workflow properties to the job configuration would enable logging and
analysis of workflows in addition to individual MapReduce jobs. Suggested
properties include a workflow ID, workflow name, adjacency list connecting
nodes in the workflow, and the name of the current node in the workflow.
mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with
the application name
e.g. pig_pigScriptId
mapreduce.workflow.name - a name for the workflow, to distinguish this
workflow from other workflows and to group different runs of the same workflow
e.g. pig command line
mapreduce.workflow.adjacency - an adjacency list for the workflow graph,
encoded as mapreduce.workflow.adjacency.source node = comma-separated list
of target nodes
mapreduce.workflow.node.name - the name of the node corresponding to this
MapReduce job in the workflow adjacency list

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-29 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754235#comment-13754235
 ] 

Dmitriy V. Ryaboy commented on PIG-3419:


[~billgraham] looping you in for Ambrose.

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Attachments: execengine.patch, mapreduce_execengine.patch, 
 stats_scriptstate.patch, test_failures.txt, test_suite.patch, 
 updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, 
 updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, 
 updated-8-29-2013-exec-engine.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-23 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749014#comment-13749014
]

Dmitriy V. Ryaboy commented on PIG-3419:

Rohini, I want to reiterate that this patch has NO tez dependencies (if it
does, that's a bug). The intention is not to make Tez possible. It's to make
pluggable execution engines possible; and I do not want that functionality to
be tied to a tez branch that will be unstable and in heavy development for the
foreseeable future. This work will be immediately useful for the Spork (pig on
spark) branch, for example.

Also, it allows people to work with new runtimes *without modifying Pig*. So
Tez-on-Pig doesn't even have to be done as a branch of this project, someone
can go an experiment completely independently.

For these reasons, I would like it in trunk.

You make a great point about the danger of changing exceptions, public methods,
etc. I believe that most of these are project-public, and annotated as such. Do
you have specific methods you are concerned about? Ideally we would change as
little as possible for the end user.

Dmitriy

Pluggable Execution Engine
---

Key: PIG-3419
URL: https://issues.apache.org/jira/browse/PIG-3419
Project: Pig
Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
Attachments: execengine.patch, mapreduce_execengine.patch,
stats_scriptstate.patch, test_failures.txt, test_suite.patch,
updated-8-22-2013-exec-engine.patch

In an effort to adapt Pig to work using Apache Tez
(https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for
a cleaner ExecutionEngine abstraction than existed before. The changes are
not that major as Pig was already relatively abstracted out between the
frontend and backend. The changes in the attached commit are essentially the
barebones changes -- I tried to not change the structure of Pig's different
components too much. I think it will be interesting to see in the future how
we can refactor more areas of Pig to really honor this abstraction between
the frontend and backend.
Some of the changes was to reinstate an ExecutionEngine interface to tie
together the front end and backend, and making the changes in Pig to delegate
to the EE when necessary, and creating an MRExecutionEngine that implements
this interface. Other work included changing ExecType to cycle through the
ExecutionEngines on the classpath and select the appropriate one (this is
done using Java ServiceLoader, exactly how MapReduce does for choosing the
framework to use between local and distributed mode). Also I tried to make
ScriptState, JobStats, and PigStats as abstract as possible in its current
state. I think in the future some work will need to be done here to perhaps
re-evaluate the usage of ScriptState and the responsibilities of the
different statistics classes. I haven't touched the PPNL, but I think more
abstraction is needed here, perhaps in a separate patch.

[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-21 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747065#comment-13747065
 ] 

Dmitriy V. Ryaboy commented on PIG-3419:


I'd like this patch in trunk since it's not Tez-specific, and allows people to 
experiment with other runtimes (for example, Spark or Drill).

 Pluggable Execution Engine 
 ---

 Key: PIG-3419
 URL: https://issues.apache.org/jira/browse/PIG-3419
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Assignee: Achal Soni
Priority: Minor
 Attachments: execengine.patch, finalpatch.patch, 
 mapreduce_execengine.patch, stats_scriptstate.patch, test_suite.patch


 In an effort to adapt Pig to work using Apache Tez 
 (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for 
 a cleaner ExecutionEngine abstraction than existed before. The changes are 
 not that major as Pig was already relatively abstracted out between the 
 frontend and backend. The changes in the attached commit are essentially the 
 barebones changes -- I tried to not change the structure of Pig's different 
 components too much. I think it will be interesting to see in the future how 
 we can refactor more areas of Pig to really honor this abstraction between 
 the frontend and backend. 
 Some of the changes was to reinstate an ExecutionEngine interface to tie 
 together the front end and backend, and making the changes in Pig to delegate 
 to the EE when necessary, and creating an MRExecutionEngine that implements 
 this interface. Other work included changing ExecType to cycle through the 
 ExecutionEngines on the classpath and select the appropriate one (this is 
 done using Java ServiceLoader, exactly how MapReduce does for choosing the 
 framework to use between local and distributed mode). Also I tried to make 
 ScriptState, JobStats, and PigStats as abstract as possible in its current 
 state. I think in the future some work will need to be done here to perhaps 
 re-evaluate the usage of ScriptState and the responsibilities of the 
 different statistics classes. I haven't touched the PPNL, but I think more 
 abstraction is needed here, perhaps in a separate patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-13 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738285#comment-13738285
]

Dmitriy V. Ryaboy commented on PIG-3419:

Hi Achal,
That's a large patch.
Can you give us a roadmap for reading it -- what are the changes, at a high
level? It looks like you had to change a bunch of stuff that's not (at first
glance) directly related to exec mode.

Procedurally:
- please generate the patch using 'git diff -no-prefix' since the apache pig
master is on svn
- please post the complete patch to Review Board, for ease of commenting
- please make sure that all new files have the apache license headers at the top

Thanks
-D

Pluggable Execution Engine
---

Key: PIG-3419
URL: https://issues.apache.org/jira/browse/PIG-3419
Project: Pig
Issue Type: New Feature
Affects Versions: 0.12
Reporter: Achal Soni
Priority: Minor
Attachments: pluggable_execengine.patch

[jira] [Commented] (PIG-3419) Pluggable Execution Engine

2013-08-13 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738288#comment-13738288
]

Dmitriy V. Ryaboy commented on PIG-3419:

oh 3 more things :)
I thought you found your way around the -y argument? I still see that in there.
Don't comment out blocks of code, just delete them
Add some documentation about creating new Exec Engines to the xml-based docs,
or at least post it here. Just having it in javadocs is not sufficient.

Pluggable Execution Engine
---

[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow

2013-07-29 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723065#comment-13723065
 ] 

Dmitriy V. Ryaboy commented on PIG-3325:


Urgh, you are right of course. I can move the .next() call into the for loop... 
but I wonder if that will slow us down again. Will check.

 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Dmitriy V. Ryaboy
Priority: Critical
 Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, 
 PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow

2013-07-26 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3325:
---

Assignee: Dmitriy V. Ryaboy  (was: Mark Wagner)
  Status: Patch Available  (was: Open)

marking as patch available. please review.

 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1, 0.11, 0.11.2
Reporter: Mark Wagner
Assignee: Dmitriy V. Ryaboy
Priority: Critical
 Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, 
 PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow

2013-06-29 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696209#comment-13696209
 ] 

Dmitriy V. Ryaboy commented on PIG-3325:


Ok I started looking at this, will update with a patch shortly. In the meantime 
-- my benchmark shows Mark's patch improves perf on small bags of 20-100 
elements, but causes extremely poor performance for large bags.

I created a benchmark that does 100 rounds of creating a bag of N elements, for 
values of N in [1,20,100,1000]. These sets of 100 rounds are run 15 times each, 
performance of the first 5 is thrown out to account for system warmup / jit 
optimizations.

Results:
||Num Tuples in Bag || Trunk avg || Patch 1 avg ||
| 1 | round: 0.00 | round: 0.00 |
| 20 | round: 0.01 | round: 0.00 |
| 100 | round: 0.13 | round: 0.00 |
| 1000 | round: 0.19 | round: 1.20 |


 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Mark Wagner
Priority: Critical
 Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow

2013-06-29 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3325:
---

Attachment: PIG-3325.2.patch

Updating with a patch.

Results:
||Num Tuples in Bag || Trunk avg || Patch 1 avg || Patch 2 avg ||
| 1 | round: 0.00 | round: 0.00 | round: 0.00 |
| 20 | round: 0.01 | round: 0.00 | round: 0.00 |
| 100 | round: 0.13 | round: 0.00 | round: 0.00
| 1000 | round: 0.19 | round: 1.20 | round: 0.03 |

I also ran Mark's bench test in a loop 10 times (again, to account for jit 
effects).

Results are as follows:

My Patch, Mark's test
 7050 ns
 450 ns
 440 ns
 550 ns
 440 ns
 440 ns
 440 ns
 440 ns
 440 ns
 540 ns
 410 ns
 440 ns
 440 ns
 430 ns
 460 ns
 
 
 Trunk, Mark's test
 243240 ns
 156640 ns
 25440 ns
 23470 ns
 18930 ns
 20710 ns
 16890 ns
 20210 ns
 17630 ns
 17900 ns
 21420 ns
 22550 ns
 22900 ns
 19800 ns
 16770 ns
 
 Mark's patch, Mark's Test
 8480 ns
 2750 ns
 2690 ns
 2760 ns
 3270 ns
 3590 ns
 6530 ns
 5900 ns
 6340 ns
 5410 ns
 5400 ns
 5420 ns
 5670 ns
 5410 ns
 5420 ns

 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Mark Wagner
Priority: Critical
 Attachments: PIG-3325.2.patch, PIG-3325.demo.patch, 
 PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow

2013-06-29 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3325:
---

Attachment: PIG-3325.3.patch

Slight update -- resetting all counters on clear(), and getting rid of an 
unnecessarily long 10K tuple test.

 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Mark Wagner
Priority: Critical
 Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, 
 PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-06-28 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13695279#comment-13695279
]

Dmitriy V. Ryaboy commented on PIG-3015:

if we find more stuff, we can open other jiras. Let's get this into trunk.

Rewrite of AvroStorage
--

Key: PIG-3015
URL: https://issues.apache.org/jira/browse/PIG-3015
Project: Pig
Issue Type: Improvement
Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
Attachments: bad.avro, good.avro, PIG-3015-10.patch,
PIG-3015-11.patch, PIG-3015-12.patch, PIG-3015-20May2013.diff,
PIG-3015-22June2013.diff, PIG-3015-2.patch, PIG-3015-3.patch,
PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch,
PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java,
Test.java, with_dates.pig

The current AvroStorage implementation has a lot of issues: it requires old
versions of Avro, it copies data much more than needed, and it's verbose and
complicated. (One pet peeve of mine is that old versions of Avro don't
support Snappy compression.)
I rewrote AvroStorage from scratch to fix these issues. In early tests, the
new implementation is significantly faster, and the code is a lot simpler.
Rewriting AvroStorage also enabled me to implement support for Trevni (as
TrevniStorage).
I'm opening this ticket to facilitate discussion while I figure out the best
way to contribute the changes back to Apache.

[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow

2013-06-20 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688911#comment-13688911
 ] 

Dmitriy V. Ryaboy commented on PIG-3325:


[~mwagner] I was loading complex thrift structures that had bags in them. With 
old code (all bags register with SMM) this led to tons of weak references that 
needed to be cleaned out by the SMM; new code fixed that, but apparently 
created this other problem (which in practice on our workloads is not 
significant.. but your workloads may be different). Looking forward to Rohini's 
patch.

 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Mark Wagner
Priority: Critical
 Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow

2013-06-20 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689465#comment-13689465
 ] 

Dmitriy V. Ryaboy commented on PIG-3325:


What if instead of figuring out size based on the first 100 elements, we 
sampled first, 11th, 21st, etc until we get 100 samples? Would help with small 
bags (where accuracy of estimate doesn't matter as much).

 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Mark Wagner
Priority: Critical
 Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow

2013-06-17 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686100#comment-13686100
 ] 

Dmitriy V. Ryaboy commented on PIG-3325:


The previous behavior (having SMM check all bags) was pretty bad, it caused 
significant sudden delays if the data you were loading had bags in it. We 
observed pretty good speed gains for those use cases once we got rid of 
mandatory bag registration. Also got rid of a few memory leaks while we were in 
there, and the linked list maintenance overhead in SMM.

 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Mark Wagner
Priority: Critical
 Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow

2013-06-10 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679862#comment-13679862
]

Dmitriy V. Ryaboy commented on PIG-3325:

[~mwagner] thanks for catching this perf regression.
I only had time for a cursory look today -- why is the existing code O(n)?
Seems like it sampled up to 100 elements and no more, so it's constant (once
n=100). Seems to me like all that materially changed was that you added the
sampling bit to add(). Unfortunately, a number of Bags override add() (see my
notes in PIG-2923), which makes doing this in the default add() of the abstract
function unreliable.

Seems to me like a better approach would be to tackle the fact that for every
time that getMemorySize() is called while there are fewer than 100 elements, we
iterate over the whole bag (which is what you mean by O(n)?). We can do this by
jumping directly to the mLastContentsSize'th element in the Bag, if we know the
structure, or at least iterate to it without calling getMemorySize(), and then
add to our running avg, rather than recomputing it. So, no resetting
aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when
sampling, just ignoring the first mLastContentsSize in the iterator.

Thoughts?

Adding a tuple to a bag is slow
---

Key: PIG-3325
URL: https://issues.apache.org/jira/browse/PIG-3325
Project: Pig
Issue Type: Bug
Affects Versions: 0.11, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Mark Wagner
Priority: Critical
Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch

The time it takes to add a tuple to a bag has increased significantly,
causing some jobs to take about 50x longer compared to 0.10.1. I've tracked
this down to PIG-2923, which has made adding a tuple heavier weight (it now
includes some memory estimation).

[jira] [Comment Edited] (PIG-3325) Adding a tuple to a bag is slow

2013-06-10 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679862#comment-13679862
 ] 

Dmitriy V. Ryaboy edited comment on PIG-3325 at 6/10/13 8:23 PM:
-

[~mwagner] thanks for catching this perf regression.
I only had time for a cursory look today -- why is the existing code O(N)? 
Seems like it sampled up to 100 elements and no more, so it's constant (once 
n=100). Seems to me like all that materially changed was that you added the 
sampling bit to add(). Unfortunately, a number of Bags override add() (see my 
notes in PIG-2923), which makes doing this in the default add() of the abstract 
function unreliable.

Seems to me like a better approach would be to tackle the fact that for every 
time that getMemorySize() is called while there are fewer than 100 elements, we 
iterate over the whole bag (which is what you mean by O(N)?). We can do this by 
jumping directly to the mLastContentsSize'th element in the Bag, if we know the 
structure, or at least iterate to it without calling getMemorySize(), and then 
add to our running avg, rather than recomputing it. So, no resetting 
aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when 
sampling, just ignoring the first mLastContentsSize in the iterator.

Thoughts?



  was (Author: dvryaboy):
[~mwagner] thanks for catching this perf regression.
I only had time for a cursory look today -- why is the existing code O(n)? 
Seems like it sampled up to 100 elements and no more, so it's constant (once 
n=100). Seems to me like all that materially changed was that you added the 
sampling bit to add(). Unfortunately, a number of Bags override add() (see my 
notes in PIG-2923), which makes doing this in the default add() of the abstract 
function unreliable.

Seems to me like a better approach would be to tackle the fact that for every 
time that getMemorySize() is called while there are fewer than 100 elements, we 
iterate over the whole bag (which is what you mean by O(n)?). We can do this by 
jumping directly to the mLastContentsSize'th element in the Bag, if we know the 
structure, or at least iterate to it without calling getMemorySize(), and then 
add to our running avg, rather than recomputing it. So, no resetting 
aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when 
sampling, just ignoring the first mLastContentsSize in the iterator.

Thoughts?
  
 Adding a tuple to a bag is slow
 ---

 Key: PIG-3325
 URL: https://issues.apache.org/jira/browse/PIG-3325
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1, 0.11.2
Reporter: Mark Wagner
Assignee: Mark Wagner
Priority: Critical
 Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch


 The time it takes to add a tuple to a bag has increased significantly, 
 causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
 this down to PIG-2923, which has made adding a tuple heavier weight (it now 
 includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3341) Improving performance of loading datetime values

2013-06-03 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673780#comment-13673780
 ] 

Dmitriy V. Ryaboy commented on PIG-3341:


I don't think we are completely consistent, but turning invalid into null has 
been pretty standard.

My personal preference is also to increment a counter for # of such 
conversions, and to log the first N occurrences (when N errors are encountered, 
log something to the effect of not logging this error any more because there's 
so much of it.)

 Improving performance of loading datetime values
 

 Key: PIG-3341
 URL: https://issues.apache.org/jira/browse/PIG-3341
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.11.1
Reporter: pat chan
Priority: Minor
 Fix For: 0.12, 0.11.2


 The performance of loading datetime values can be improved by about 25% by 
 moving a single line in ToDate.java:
 public static DateTimeZone extractDateTimeZone(String dtStr) {
   Pattern pattern = 
 Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$);;
 should become:
 static Pattern pattern = 
 Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$);
 public static DateTimeZone extractDateTimeZone(String dtStr) {
 There is no need to recompile the regular expression for every value. I'm not 
 sure if this function is ever called concurrently, but Pattern objects are 
 thread-safe anyways.
 As a test, I created a file of 10M timestamps:
   for i in 0..1000
 puts '2000-01-01T00:00:00+23'
   end
 I then ran this script:
   grunt A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
 Before the change it took 160s.
 After the change, the script took 120s.
 
 Another performance improvement can be made for invalid datetime values. If a 
 datetime value is invalid, an exception is created and thrown, which is a 
 costly way to fail a validity check. To test the performance impact, I 
 created 10M invalid datetime values:
   for i in 0..1000
 puts '2000-99-01T00:00:00+23'
   end
 In this test, the regex pattern was always recompiled. I then ran this script:
   grunt A = load 'data' as (a:datetime); B = filter A by a is not null; dump 
 B;
 The script took 190s.
 I understand this could be considered an edge case and might not be worth 
 changing. However, if there are use cases where invalid dates are part of 
 normal processing, then you might consider fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3198) Let users use any function from PigType - PigType as if it were builtlin

2013-04-19 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13636681#comment-13636681
 ] 

Dmitriy V. Ryaboy commented on PIG-3198:


Please add docs!

 Let users use any function from PigType - PigType as if it were builtlin
 -

 Key: PIG-3198
 URL: https://issues.apache.org/jira/browse/PIG-3198
 Project: Pig
  Issue Type: Bug
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
 Fix For: 0.12

 Attachments: PIG-3198-0.patch, PIG-3198-1.patch, 
 PIG-3198-apache_header.patch


 This idea is an extension of PIG-2643. Ideally, someone should be able to 
 call any function currently registered in Pig as if it were builtin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-3284) Document PIG-3198 and PIG-2643

2013-04-19 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-3284:
--

Assignee: Jonathan Coveney

:-)

 Document PIG-3198 and PIG-2643
 --

 Key: PIG-3284
 URL: https://issues.apache.org/jira/browse/PIG-3284
 Project: Pig
  Issue Type: Task
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney

 These improvements are quite useful, but only if people know that they exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3267) HCatStorer fail in limit query

2013-04-11 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629601#comment-13629601
 ] 

Dmitriy V. Ryaboy commented on PIG-3267:


Should we apply this to 0.11 too?

 HCatStorer fail in limit query
 --

 Key: PIG-3267
 URL: https://issues.apache.org/jira/browse/PIG-3267
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.2, 0.10.1, 0.11.1
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG-3267-1.patch


 The following query fail:
 {code}
 data = LOAD 'student.txt' as (name:chararray, age:int, gpa:double);
 data_limited = limit data 10;
 samples = foreach data_limited generate age as number;
 store samples into 'samples' using 
 org.apache.hcatalog.pig.HCatStorer('part_dt=20130101T01T36');
 {code}
 Error happens before launching the second job. Error message:
 {code}
 Message: org.apache.hadoop.mapred.FileAlreadyExistsException: Output 
 directory 
 hdfs://localhost:8020/user/hive/warehouse/samples/part_dt=20130101T01T36 
 already exists
   at 
 org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
   at 
 org.apache.hcatalog.mapreduce.FileOutputFormatContainer.checkOutputSpecs(FileOutputFormatContainer.java:135)
   at 
 org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:72)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecsHelper(PigOutputFormat.java:207)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:188)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
   at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
   at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
   at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157)
   at 
 org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134)
   at java.lang.Thread.run(Thread.java:680)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:257)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3267) HCatStorer fail in limit query

2013-04-11 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629603#comment-13629603
 ] 

Dmitriy V. Ryaboy commented on PIG-3267:


(+1)

 HCatStorer fail in limit query
 --

 Key: PIG-3267
 URL: https://issues.apache.org/jira/browse/PIG-3267
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.2, 0.10.1, 0.11.1
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG-3267-1.patch


 The following query fail:
 {code}
 data = LOAD 'student.txt' as (name:chararray, age:int, gpa:double);
 data_limited = limit data 10;
 samples = foreach data_limited generate age as number;
 store samples into 'samples' using 
 org.apache.hcatalog.pig.HCatStorer('part_dt=20130101T01T36');
 {code}
 Error happens before launching the second job. Error message:
 {code}
 Message: org.apache.hadoop.mapred.FileAlreadyExistsException: Output 
 directory 
 hdfs://localhost:8020/user/hive/warehouse/samples/part_dt=20130101T01T36 
 already exists
   at 
 org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
   at 
 org.apache.hcatalog.mapreduce.FileOutputFormatContainer.checkOutputSpecs(FileOutputFormatContainer.java:135)
   at 
 org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:72)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecsHelper(PigOutputFormat.java:207)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:188)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
   at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
   at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
   at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157)
   at 
 org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134)
   at java.lang.Thread.run(Thread.java:680)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:257)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0

2013-04-04 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621897#comment-13621897
 ] 

Dmitriy V. Ryaboy commented on PIG-2769:


Didn't see earlier that this only went into trunk (thanks [~knoguchi] for 
pointing this out!).
We should put this into 0.11 branch, maybe there will be an 0.11.2 before 12 
comes out.

 a simple logic causes very long compiling time on pig 0.10.0
 

 Key: PIG-2769
 URL: https://issues.apache.org/jira/browse/PIG-2769
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.10.0
 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported)
Reporter: Dan Li
Assignee: Nick White
 Fix For: 0.12

 Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, 
 PIG-2769.2.patch, 
 TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt


 We found the following simple logic will cause very long compiling time for 
 pig 0.10.0, while using pig 0.8.1, everything is fine.
 A = load 'A.txt' using PigStorage()  AS (m: int);
 B = FOREACH A {
 days_str = (chararray)
 (m == 1 ? 31: 
 (m == 2 ? 28: 
 (m == 3 ? 31: 
 (m == 4 ? 30: 
 (m == 5 ? 31: 
 (m == 6 ? 30: 
 (m == 7 ? 31: 
 (m == 8 ? 31: 
 (m == 9 ? 30: 
 (m == 10 ? 31: 
 (m == 11 ? 30:31)));
 GENERATE
days_str as days_str;
 }   
 store B into 'B';
 and here's a simple input file example: A.txt
 1
 2
 3
 The pig version we used in the test
 Apache Pig version 0.10.0-SNAPSHOT (rexported)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PIG-3151) No documentation for Pig 0.10.1

2013-04-04 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy resolved PIG-3151.


  Resolution: Won't Fix
Release Note: we're past this now.. resolving so I can release the 
release in jira

 No documentation for Pig 0.10.1
 ---

 Key: PIG-3151
 URL: https://issues.apache.org/jira/browse/PIG-3151
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.10.1
Reporter: Russell Jurney
Assignee: Daniel Dai
Priority: Critical
 Fix For: 0.10.1


 http://pig.apache.org/docs/r0.10.1/start.html is missing!
 http://pig.apache.org/docs/r0.10.0/start.html is there.
 Are there no docs for 0.10.1? Arg! :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0

2013-04-04 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-2769:
---

Fix Version/s: 0.11.2

 a simple logic causes very long compiling time on pig 0.10.0
 

 Key: PIG-2769
 URL: https://issues.apache.org/jira/browse/PIG-2769
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.10.0
 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported)
Reporter: Dan Li
Assignee: Nick White
 Fix For: 0.12, 0.11.2

 Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, 
 PIG-2769.2.patch, 
 TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt


 We found the following simple logic will cause very long compiling time for 
 pig 0.10.0, while using pig 0.8.1, everything is fine.
 A = load 'A.txt' using PigStorage()  AS (m: int);
 B = FOREACH A {
 days_str = (chararray)
 (m == 1 ? 31: 
 (m == 2 ? 28: 
 (m == 3 ? 31: 
 (m == 4 ? 30: 
 (m == 5 ? 31: 
 (m == 6 ? 30: 
 (m == 7 ? 31: 
 (m == 8 ? 31: 
 (m == 9 ? 30: 
 (m == 10 ? 31: 
 (m == 11 ? 30:31)));
 GENERATE
days_str as days_str;
 }   
 store B into 'B';
 and here's a simple input file example: A.txt
 1
 2
 3
 The pig version we used in the test
 Apache Pig version 0.10.0-SNAPSHOT (rexported)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0

2013-04-04 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622756#comment-13622756
 ] 

Dmitriy V. Ryaboy commented on PIG-2769:


in 0.11 branch now.

 a simple logic causes very long compiling time on pig 0.10.0
 

 Key: PIG-2769
 URL: https://issues.apache.org/jira/browse/PIG-2769
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.10.0
 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported)
Reporter: Dan Li
Assignee: Nick White
 Fix For: 0.12, 0.11.2

 Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, 
 PIG-2769.2.patch, 
 TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt


 We found the following simple logic will cause very long compiling time for 
 pig 0.10.0, while using pig 0.8.1, everything is fine.
 A = load 'A.txt' using PigStorage()  AS (m: int);
 B = FOREACH A {
 days_str = (chararray)
 (m == 1 ? 31: 
 (m == 2 ? 28: 
 (m == 3 ? 31: 
 (m == 4 ? 30: 
 (m == 5 ? 31: 
 (m == 6 ? 30: 
 (m == 7 ? 31: 
 (m == 8 ? 31: 
 (m == 9 ? 30: 
 (m == 10 ? 31: 
 (m == 11 ? 30:31)));
 GENERATE
days_str as days_str;
 }   
 store B into 'B';
 and here's a simple input file example: A.txt
 1
 2
 3
 The pig version we used in the test
 Apache Pig version 0.10.0-SNAPSHOT (rexported)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3264) mvn signanddeploy target broken for pigunit, pigsmoke and piggybank

2013-04-04 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3264:
---

Fix Version/s: 0.11.2

 mvn signanddeploy target broken for pigunit, pigsmoke and piggybank
 ---

 Key: PIG-3264
 URL: https://issues.apache.org/jira/browse/PIG-3264
 Project: Pig
  Issue Type: Bug
Reporter: Bill Graham
Assignee: Bill Graham
 Fix For: 0.11.2

 Attachments: PIG_3264.1.patch, PIG_3264_branch11.1.patch


 Build fails with:
 {noformat}
 [artifact:deploy] Invalid reference: 'pigunit'
 {noformat}
 Patch on the way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3222) New UDFContextSignature assignments in Pig 0.11 breaks HCatalog.HCatStorer

2013-03-28 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616999#comment-13616999
 ] 

Dmitriy V. Ryaboy commented on PIG-3222:


This is pretty confusing. Any ideas on how to fix this? Can we get away from 
the whole instantiation thing, and maybe keep an object registry?

 New UDFContextSignature assignments in Pig 0.11 breaks HCatalog.HCatStorer 
 ---

 Key: PIG-3222
 URL: https://issues.apache.org/jira/browse/PIG-3222
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11
Reporter: Feng Peng
  Labels: hcatalog
 Attachments: PigStorerDemo.java


 Pig 0.11 assigns different UDFContextSignature for different invocations of 
 the same load/store statement. This change breaks the HCatStorer which 
 assumes all front-end and back-end invocations of the same store statement 
 has the same UDFContextSignature so that it can read the previously stored 
 information correctly.
 The related HCatalog code is in 
 https://svn.apache.org/repos/asf/incubator/hcatalog/branches/branch-0.5/hcatalog-pig-adapter/src/main/java/org/apache/hcatalog/pig/HCatStorer.java
  (the setStoreLocation() function).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3258) Patch to allow MultiStorage to use more than one index to generate output tree

2013-03-22 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611467#comment-13611467
 ] 

Dmitriy V. Ryaboy commented on PIG-3258:


please generate patch against the project root.

 Patch to allow MultiStorage to use more than one index to generate output tree
 --

 Key: PIG-3258
 URL: https://issues.apache.org/jira/browse/PIG-3258
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joel Fouse
Priority: Minor
  Labels: piggybank

 I have made a patch to enable MultiStorage to handle multiple tuple indexes, 
 rather than only one, for generating the output directory structure.  Before 
 I submit it, though, I need to know if I should generate the patch from 
 /contrib/piggybank/java where I've been compiling and unit testing, or back 
 at the project root.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-03-22 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611472#comment-13611472
]

Dmitriy V. Ryaboy commented on PIG-2586:

Do we need this given Ambrose (and from what I hear, Ambari)?

What is the difference between what this proposes and what Ambrose does?

https://github.com/twitter/ambrose

There is an Ambrose patch to add inner plans, too:
https://github.com/twitter/ambrose/issues/62

A better plan/data flow visualizer
--

Key: PIG-2586
URL: https://issues.apache.org/jira/browse/PIG-2586
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Daniel Dai
Labels: gsoc2013

Pig supports a dot graph style plan to visualize the
logical/physical/mapreduce plan (explain with -dot option, see
http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html).
However, dot graph takes extra step to generate the plan graph and the
quality of the output is not good. It's better we can implement a better
visualizer for Pig. It should:
1. show operator type and alias
2. turn on/off output schema
3. dive into foreach inner plan on demand
4. provide a way to show operator source code, eg, tooltip of an operator
(plan don't currently have this information, but you can assume this is in
place)
5. besides visualize logical/physical/mapreduce plan, visualize the script
itself is also useful
6. may rely on some java graphic library such as Swing
This is a candidate project for Google summer of code 2013. More information
about the program can be found at
https://cwiki.apache.org/confluence/display/PIG/GSoc2013

[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-03-22 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611478#comment-13611478
]

Dmitriy V. Ryaboy commented on PIG-2586:

It does with the linked patch (it also visualizes the MR plan, without details
of what's happening inside the map or reduce stage, without the patch).

A better plan/data flow visualizer
--

Key: PIG-2586
URL: https://issues.apache.org/jira/browse/PIG-2586
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Daniel Dai
Labels: gsoc2013

[jira] [Commented] (PIG-2586) A better plan/data flow visualizer

2013-03-22 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611490#comment-13611490
]

Dmitriy V. Ryaboy commented on PIG-2586:

Hm I guess we can add logical plan if we want -- just need to feed it to the
PPNL somehow. Ambrose is pretty separate from Pig specifics, if you give it a
dag, it'll draw it.

Do people use the logical plan to diagnose issues? I don't think I have had to
do that yet.

A better plan/data flow visualizer
--

Key: PIG-2586
URL: https://issues.apache.org/jira/browse/PIG-2586
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Daniel Dai
Labels: gsoc2013

[jira] [Commented] (PIG-3254) Fail a failed Pig script quicker

2013-03-20 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608070#comment-13608070
]

Dmitriy V. Ryaboy commented on PIG-3254:

Can I add a request for whoever will work on this ticket?

Right now we die with MR Job Failed but don't say which job. In cases when
multiple jobs are launched, one of them fails, the other ones are killed, and
users find it hard to figure out which job was the cause of all badness. It
would be nice to print out the job id of the failed job.

Fail a failed Pig script quicker

Key: PIG-3254
URL: https://issues.apache.org/jira/browse/PIG-3254
Project: Pig
Issue Type: Improvement
Reporter: Daniel Dai
Fix For: 0.12

Credit to [~asitecn]. Currently Pig could launch several mapreduce job
simultaneously. When one mapreduce job fail, we need to wait for simultaneous
mapreduce job finish. In addition, we could potentially launch additional
jobs which is doomed to fail. However, this is unnecessary in some cases:
* If stop.on.failure==true, we can kill parallel jobs, and fail the whole
script
* If stop.on.failure==false, and no store could success, we can also kill
parallel jobs, and fail the whole script
Consider simultaneous jobs may take a long time to finish, this could
significantly improve the turn around in some cases.

[jira] [Commented] (PIG-3132) NPE when illustrating a relation with HCatLoader

2013-03-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604929#comment-13604929
 ] 

Dmitriy V. Ryaboy commented on PIG-3132:


+1

  NPE when illustrating a relation with HCatLoader
 -

 Key: PIG-3132
 URL: https://issues.apache.org/jira/browse/PIG-3132
 Project: Pig
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG-3132-1.patch


 Get NPE exception when illustrate a relation with HCatLoader:
 {code}
 A = LOAD 'studenttab10k' USING org.apache.hcatalog.pig.HCatLoader();
 illustrate A;
 {code}
 Exception:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.hcatalog.pig.PigHCatUtil.transformToTuple(PigHCatUtil.java:274)
 at 
 org.apache.hcatalog.pig.PigHCatUtil.transformToTuple(PigHCatUtil.java:238)
 at 
 org.apache.hcatalog.pig.HCatBaseLoader.getNext(HCatBaseLoader.java:61)
 at 
 org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:210)
 at 
 org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:190)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:129)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
 at 
 org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:194)
 at 
 org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)
 at 
 org.apache.pig.pen.ExampleGenerator.readBaseData(ExampleGenerator.java:222)
 at 
 org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:154)
 at org.apache.pig.PigServer.getExamples(PigServer.java:1245)
 at 
 org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:698)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:591)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:306)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:67)
 {code}
 HCatalog side is tracked with HCATALOG-163.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3208) [zebra] TFile should not set io.compression.codec.lzo.buffersize

2013-03-18 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605765#comment-13605765
 ] 

Dmitriy V. Ryaboy commented on PIG-3208:


[~daijy] why wouldn't we commit fixes provided by community? 

 [zebra] TFile should not set io.compression.codec.lzo.buffersize
 

 Key: PIG-3208
 URL: https://issues.apache.org/jira/browse/PIG-3208
 Project: Pig
  Issue Type: Bug
Reporter: Eugene Koontz
Assignee: Eugene Koontz
 Attachments: PIG-3208.patch


 In contrib/zebra/src/java/org/apache/hadoop/zebra/tfile/Compression.java, the 
 following occurs:
 {code}
 conf.setInt(io.compression.codec.lzo.buffersize, 64 * 1024);
 {code}
 This can cause the LZO decompressor, if called within the context of reading 
 TFiles, to return with an error code when trying to uncompress LZO-compressed 
 data, if the data's compressed size is too large to fit in 64 * 1024 bytes.
 For example, the Hadoop-LZO code uses a different default value (256 * 1024):
 https://github.com/twitter/hadoop-lzo/blob/master/src/java/com/hadoop/compression/lzo/LzoCodec.java#L185
 This can lead to a case where, if data is compressed with a cluster where the 
 default {{io.compression.codec.lzo.buffersize}} = 256*1024 is used, then code 
 that tries to read this data by using Pig's zebra, the Mapper will exit with 
 code 134 because the LZO compressor returns a -4 (which encodes the LZO C 
 library error LZO_E_INPUT_OVERRUN) when trying to uncompress the data. The 
 stack trace of such a case is shown below:
 {code}
 2013-02-17 14:47:50,709 INFO com.hadoop.compression.lzo.LzoCodec: Creating 
 stream for compressor: com.hadoop.compression.lzo.LzoCompressor@6818c458 with 
 bufferSize: 262144
 2013-02-17 14:47:50,849 INFO org.apache.hadoop.io.compress.CodecPool: Paying 
 back codec: com.hadoop.compression.lzo.LzoCompressor@6818c458
 2013-02-17 14:47:50,849 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
 3
 2013-02-17 14:47:50,857 INFO org.apache.hadoop.io.compress.CodecPool: 
 Borrowing codec: com.hadoop.compression.lzo.LzoCompressor@6818c458
 2013-02-17 14:47:50,866 INFO com.hadoop.compression.lzo.LzoCodec: Creating 
 stream for compressor: com.hadoop.compression.lzo.LzoCompressor@6818c458 with 
 bufferSize: 262144
 2013-02-17 14:47:50,879 INFO org.apache.hadoop.io.compress.CodecPool: Paying 
 back codec: com.hadoop.compression.lzo.LzoCompressor@6818c458
 2013-02-17 14:47:50,879 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
 4
 2013-02-17 14:47:50,887 INFO org.apache.hadoop.mapred.Merger: Merging 5 
 sorted segments
 2013-02-17 14:47:50,890 INFO org.apache.hadoop.io.compress.CodecPool: 
 Borrowing codec: com.hadoop.compression.lzo.LzoDecompressor@66a23610
 2013-02-17 14:47:50,891 INFO com.hadoop.compression.lzo.LzoDecompressor: 
 calling decompressBytesDirect with buffer with: position: 0 and limit: 262144
 2013-02-17 14:47:50,891 INFO com.hadoop.compression.lzo.LzoDecompressor: 
 read: 245688 bytes from decompressor.
 2013-02-17 14:47:50,891 INFO org.apache.hadoop.io.compress.CodecPool: 
 Borrowing codec: com.hadoop.compression.lzo.LzoDecompressor@43684706
 2013-02-17 14:47:50,892 INFO com.hadoop.compression.lzo.LzoDecompressor: 
 calling decompressBytesDirect with buffer with: position: 0 and limit: 65536
 2013-02-17 14:47:50,895 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
 2013-02-17 14:47:50,897 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.InternalError: lzo1x_decompress returned: -4
 at 
 com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native 
 Method)
 at 
 com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:307)
 at 
 org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
 at 
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
 at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:341)
 at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:371)
 at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355)
 at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:387)
 at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
 at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420)
 at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
 at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180)
 at

[jira] [Commented] (PIG-2388) Make shim for Hadoop 0.20 and 0.23 support dynamic

2013-03-18 Thread Dmitriy V. Ryaboy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605849#comment-13605849
]

Dmitriy V. Ryaboy commented on PIG-2388:

Hive does this, and back in the day there was a patch that did this for Pig and
hadoop 18 vs hadoop 20.
Should be doable, though it'll take work..

Make shim for Hadoop 0.20 and 0.23 support dynamic
--

Key: PIG-2388
URL: https://issues.apache.org/jira/browse/PIG-2388
Project: Pig
Issue Type: Improvement
Affects Versions: 0.9.2, 0.10.0
Reporter: Thomas Weise
Fix For: 0.9.2, 0.10.0

Attachments: PIG-2388_branch-0.9.patch

We need a single Pig installation that works with both Hadoop versions. The
current shim implementation assumes different builds for each version. We can
solve this statically through internal build/installation system or by making
the shim dynamic so that pig.jar will work on both version with runtime
detection. Attached patch is to convert the static shims into a shim
interface with 2 implementations, each of which will be compiled against the
respective Hadoop version and included into single pig.jar (similar to what
Hive does).
The default build behavior remains unchanged, only the shim for
${hadoopversion} will be compiled. Both shims can be built via: ant
-Dbuild-all-shims=true

[jira] [Commented] (PIG-3194) Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2

2013-03-15 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603984#comment-13603984
 ] 

Dmitriy V. Ryaboy commented on PIG-3194:


+1

 Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2
 ---

 Key: PIG-3194
 URL: https://issues.apache.org/jira/browse/PIG-3194
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Kai Londenberg
Assignee: Prashant Kommireddi
 Fix For: 0.11.1

 Attachments: PIG-3194_2.patch, PIG-3194.patch


 The changes to ObjectSerializer.java in the following commit
 http://svn.apache.org/viewvc?view=revisionrevision=1403934 break 
 compatibility with Hadoop 0.20.2 Clusters.
 The reason is, that the code uses methods from Apache Commons Codec 1.4 - 
 which are not available in Apache Commons Codec 1.3 which is shipping with 
 Hadoop 0.20.2.
 The offending methods are Base64.decodeBase64(String) and 
 Base64.encodeBase64URLSafeString(byte[])
 If I revert these changes, Pig 0.11.0 candidate 2 works well with our Hadoop 
 0.20.2 Clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3245) Documentation about HBaseStorage

2013-03-14 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3245:
---

Status: Patch Available  (was: Open)

 Documentation about HBaseStorage
 

 Key: PIG-3245
 URL: https://issues.apache.org/jira/browse/PIG-3245
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.11
Reporter: Daisuke Kobayashi
 Attachments: PIG-3245.patch


 HBaseStorage always disable split combination.  It should be documented 
 explicitly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3241) ConcurrentModificationException in POPartialAgg

2013-03-12 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600602#comment-13600602
 ] 

Dmitriy V. Ryaboy commented on PIG-3241:


I think I have a clean fix, Lohit and I are testing.

 ConcurrentModificationException in POPartialAgg
 ---

 Key: PIG-3241
 URL: https://issues.apache.org/jira/browse/PIG-3241
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Lohit Vijayarenu
Priority: Blocker
 Fix For: 0.12, 0.11.1


 While running few PIG scripts against Hadoop 2.0, I see consistently see 
 ConcurrentModificationException 
 {noformat}
 at java.util.HashMap$HashIterator.remove(HashMap.java:811)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
 {noformat}
 It looks like there is rawInputMap is being modified while elements are 
 removed from it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-3241) ConcurrentModificationException in POPartialAgg

2013-03-12 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-3241:
--

Assignee: Dmitriy V. Ryaboy

 ConcurrentModificationException in POPartialAgg
 ---

 Key: PIG-3241
 URL: https://issues.apache.org/jira/browse/PIG-3241
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Lohit Vijayarenu
Assignee: Dmitriy V. Ryaboy
Priority: Blocker
 Fix For: 0.12, 0.11.1


 While running few PIG scripts against Hadoop 2.0, I see consistently see 
 ConcurrentModificationException 
 {noformat}
 at java.util.HashMap$HashIterator.remove(HashMap.java:811)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
 {noformat}
 It looks like there is rawInputMap is being modified while elements are 
 removed from it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3241) ConcurrentModificationException in POPartialAgg

2013-03-12 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-3241:
---

Attachment: PIG-3241.patch

Attaching patch.

Rather than synchronize all memory access, I decided to simply avoid concurrent 
access all together. spill(), called by Spillable Memory Manager, used to set 
up the iterator used for spilling - that involved looking at the primary and 
secondary maps, applying the combiner to them, doing all kinds of things -- all 
in the SMM thread.

Instead, we now only set the doSpill flag in spill(), and do the work in the 
main thread, which now is the only thread that can modify iterators and 
hashmaps.

Most of this patch is just whitespace changes :).

 ConcurrentModificationException in POPartialAgg
 ---

 Key: PIG-3241
 URL: https://issues.apache.org/jira/browse/PIG-3241
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Lohit Vijayarenu
Assignee: Dmitriy V. Ryaboy
Priority: Blocker
 Fix For: 0.12, 0.11.1

 Attachments: PIG-3241.patch


 While running few PIG scripts against Hadoop 2.0, I see consistently see 
 ConcurrentModificationException 
 {noformat}
 at java.util.HashMap$HashIterator.remove(HashMap.java:811)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
 {noformat}
 It looks like there is rawInputMap is being modified while elements are 
 removed from it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-03-12 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600645#comment-13600645
 ] 

Dmitriy V. Ryaboy commented on PIG-3015:


Serious question: is there a reason to put this in Pig rather than keep 
elsewhere, where you can iterate without being tied to Pig's release cycle?

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-10.patch, 
 PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, 
 PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, 
 with_dates.pig


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 3 4 5 6 7 8 >

1 - 100 of 727 matches

Mail list logo