[jira] [Updated] (PIG-4059) Pig on Spark
[ https://issues.apache.org/jira/browse/PIG-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-4059: --- Labels: spork (was: ) Pig on Spark Key: PIG-4059 URL: https://issues.apache.org/jira/browse/PIG-4059 Project: Pig Issue Type: New Feature Reporter: Rohini Palaniswamy Assignee: Praveen Rachabattuni Labels: spork Attachments: Pig-on-Spark-Design-Doc.pdf There is lot of interest in adding Spark as a backend execution engine for Pig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067289#comment-14067289 ] Dmitriy V. Ryaboy commented on PIG-3558: Nice. How much does this increase the weight of the pig build, and what packages does it pull in? I assume this won't get pushed to trunk until hive 0.14.0-SNAPSHOT becomes available as a stable version? ORC support for Pig --- Key: PIG-3558 URL: https://issues.apache.org/jira/browse/PIG-3558 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Daniel Dai Labels: porc Fix For: 0.14.0 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch, PIG-3558-4.patch, PIG-3558-5.patch, PIG-3558-6.patch Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3558: --- Labels: porc (was: ) ORC support for Pig --- Key: PIG-3558 URL: https://issues.apache.org/jira/browse/PIG-3558 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Daniel Dai Labels: porc Fix For: 0.13.0 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch, PIG-3558-4.patch Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904574#comment-13904574 ] Dmitriy V. Ryaboy commented on PIG-3558: I am pro adding ORC support in Pig, but against introducing massive dependencies. According to http://mvnrepository.com/artifact/org.apache.hive/hive-exec/0.12.0 the hive-exec jar for 0.12 is 9 megs, and hides within it specific versions of jackson, snappy, org.json, chunks of thrift, hadoop.io (?!), avro, commons, protobuf, and guava. If ORC authors are not interested in reducing their dependency hygene, they have to live with the fact that their project is unlikely to get integrated into other projects. This is self-inflicted jar hell. Please don't do this. When ORC cleans up their dependencies, let's revisit. ORC support for Pig --- Key: PIG-3558 URL: https://issues.apache.org/jira/browse/PIG-3558 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.13.0 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904574#comment-13904574 ] Dmitriy V. Ryaboy edited comment on PIG-3558 at 2/18/14 8:55 PM: - I am pro adding ORC support in Pig, but against introducing massive dependencies. According to http://mvnrepository.com/artifact/org.apache.hive/hive-exec/0.12.0 the hive-exec jar for 0.12 is 9 megs, and hides within it specific versions of jackson, snappy, org.json, chunks of thrift, hadoop.io (?!), avro, commons, protobuf, and guava. If ORC authors are not interested in improving their dependency hygene, they have to live with the fact that their project is unlikely to get integrated into other projects. This is self-inflicted jar hell. Please don't do this. When ORC cleans up their dependencies, let's revisit. was (Author: dvryaboy): I am pro adding ORC support in Pig, but against introducing massive dependencies. According to http://mvnrepository.com/artifact/org.apache.hive/hive-exec/0.12.0 the hive-exec jar for 0.12 is 9 megs, and hides within it specific versions of jackson, snappy, org.json, chunks of thrift, hadoop.io (?!), avro, commons, protobuf, and guava. If ORC authors are not interested in reducing their dependency hygene, they have to live with the fact that their project is unlikely to get integrated into other projects. This is self-inflicted jar hell. Please don't do this. When ORC cleans up their dependencies, let's revisit. ORC support for Pig --- Key: PIG-3558 URL: https://issues.apache.org/jira/browse/PIG-3558 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.13.0 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904590#comment-13904590 ] Dmitriy V. Ryaboy commented on PIG-3558: So that's a -1. I would +1 this if it was going into piggybank. Since this depends on unpublished changes, I'd rather we unlink it from 0.13 release (as that would tie us to Hive's release schedule -- obviously we can't make a release that depends on a snapshot). ORC support for Pig --- Key: PIG-3558 URL: https://issues.apache.org/jira/browse/PIG-3558 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.13.0 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904627#comment-13904627 ] Dmitriy V. Ryaboy commented on PIG-3558: [~daijy] not quite: {code} - conf=test-master / + conf=compile-master / {code} ORC support for Pig --- Key: PIG-3558 URL: https://issues.apache.org/jira/browse/PIG-3558 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.13.0 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904663#comment-13904663 ] Dmitriy V. Ryaboy commented on PIG-3558: Help me understand this. My understanding is as follows: Compile is minimum required to compile main code. Test is minimum required to compile main code + stuff needed to test (hence, extends). Pushing a dependency up to compile means everything, not just test, needs the dependency. Also, the bump from 0.8 to 0.12 is 6 megs worth of code. That's a pretty big version bump. ORC support for Pig --- Key: PIG-3558 URL: https://issues.apache.org/jira/browse/PIG-3558 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.13.0 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3456) Reduce threadlocal conf access in backend for each record
[ https://issues.apache.org/jira/browse/PIG-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13896199#comment-13896199 ] Dmitriy V. Ryaboy commented on PIG-3456: Added a couple minor comments. Good change overall. BTW not sure if you saw, but PIG-3325 addressed the bag insertion regression you saw as a side effect of PIG-2923 without sacrificing the memory and gc benefits 2923 provides, so if you still have that reverted in your build, consider un-reverting.. Reduce threadlocal conf access in backend for each record - Key: PIG-3456 URL: https://issues.apache.org/jira/browse/PIG-3456 Project: Pig Issue Type: Improvement Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.13.0 Attachments: PIG-3456-1-no-whitespace.patch, PIG-3456-1.patch Noticed few things while browsing code 1) DefaultTuple has a protected boolean isNull = false; which is never used. Removing this gives ~3-5% improvement for big jobs 2) Config checking with ThreadLocal conf is repeatedly done for each record. For eg: createDataBag in POCombinerPackage. But initialized only for first time in other places like POPackage, POJoinPackage, etc. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3456) Reduce threadlocal conf access in backend for each record
[ https://issues.apache.org/jira/browse/PIG-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887098#comment-13887098 ] Dmitriy V. Ryaboy commented on PIG-3456: Could you post a patch without the whitespace changes (for ease of review) and some microbenchmark results? I had some microbenchmark code in PIG-3325, that might help bootstrap you here. Reduce threadlocal conf access in backend for each record - Key: PIG-3456 URL: https://issues.apache.org/jira/browse/PIG-3456 Project: Pig Issue Type: Improvement Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.13.0 Attachments: PIG-3456-1.patch Noticed few things while browsing code 1) DefaultTuple has a protected boolean isNull = false; which is never used. Removing this gives ~3-5% improvement for big jobs 2) Config checking with ThreadLocal conf is repeatedly done for each record. For eg: createDataBag in POCombinerPackage. But initialized only for first time in other places like POPackage, POJoinPackage, etc. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3672) pig should not hardcode hdfs:// path in code, should be configurable to other file system implementations
[ https://issues.apache.org/jira/browse/PIG-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3672: --- Status: Open (was: Patch Available) cancelling patch available status given Rohini's comments -- please make patch available again when a new patch is submitted pig should not hardcode hdfs:// path in code, should be configurable to other file system implementations --- Key: PIG-3672 URL: https://issues.apache.org/jira/browse/PIG-3672 Project: Pig Issue Type: Bug Components: data, parser Affects Versions: 0.11.1, 0.12.0, 0.10.0 Reporter: Suhas Satish Assignee: Suhas Satish Attachments: PIG-3672-1.patch, PIG-3672-2.patch, PIG-3672.patch QueryParserUtils.java has the code - result.add(hdfs://+thisHost+:+uri.getPort()); I propose to make it generic like - result.add(uri.getScheme() + ://+thisHost+:+uri.getPort()); Similarly jobControlCompiler.java has - if (!outputPathString.contains(://) || outputPathString.startsWith(hdfs://)) { I have a patch version which I ran passing unit tests on. Will be uploading it shortly. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3299) Provide support for LazyOutputFormat to avoid creating empty files
[ https://issues.apache.org/jira/browse/PIG-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887109#comment-13887109 ] Dmitriy V. Ryaboy commented on PIG-3299: [~daijy] shall we commit this? Provide support for LazyOutputFormat to avoid creating empty files -- Key: PIG-3299 URL: https://issues.apache.org/jira/browse/PIG-3299 Project: Pig Issue Type: Improvement Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy Assignee: Lorand Bendig Attachments: PIG-3299.patch LazyOutputFormat (HADOOP-4927) in hadoop is a wrapper to avoid creating part files if there is no records output. It would be good to add support for that by having a configuration in pig which wraps storeFunc.getOutputFormat() with LazyOutputFormat. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3347: --- Priority: Critical (was: Major) Store invocation brings side effect --- Key: PIG-3347 URL: https://issues.apache.org/jira/browse/PIG-3347 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.11 Environment: local mode Reporter: Sergey Assignee: Daniel Dai Priority: Critical Fix For: 0.12.1 Attachments: PIG-3347-1.patch The problem is that intermediate 'store' invocation changes the final store output. Looks like it brings some kind of side effect. We did use 'local' mode to run script here is the input data: 1 1 Here is the script: {code} a = load 'test'; a_group = group a by $0; b = foreach a_group { a_distinct = distinct a.$0; generate group, a_distinct; } --store b into 'b'; c = filter b by SIZE(a_distinct) == 1; store c into 'out'; {code} We expect output to be: 1 1 The output is empty file. Uncomment {code}--store b into 'b';{code} line and see the diffrence. Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887114#comment-13887114 ] Dmitriy V. Ryaboy commented on PIG-3347: Yikes. [~aniket486] [~julienledem] this seems like a critical bug to look at. Julien, you investigated this UID situation before, right? Store invocation brings side effect --- Key: PIG-3347 URL: https://issues.apache.org/jira/browse/PIG-3347 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.11 Environment: local mode Reporter: Sergey Assignee: Daniel Dai Priority: Critical Fix For: 0.12.1 Attachments: PIG-3347-1.patch The problem is that intermediate 'store' invocation changes the final store output. Looks like it brings some kind of side effect. We did use 'local' mode to run script here is the input data: 1 1 Here is the script: {code} a = load 'test'; a_group = group a by $0; b = foreach a_group { a_distinct = distinct a.$0; generate group, a_distinct; } --store b into 'b'; c = filter b by SIZE(a_distinct) == 1; store c into 'out'; {code} We expect output to be: 1 1 The output is empty file. Uncomment {code}--store b into 'b';{code} line and see the diffrence. Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache
[ https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883478#comment-13883478 ] Dmitriy V. Ryaboy commented on PIG-2672: [~knoguchi] in the spirit of keeping things moving -- can we commit this? You can feel free to turn the behavior off on your cluster if you are worried about the 1 week boundary. If that's the case, feel free to open another ticket to follow up, or to make sure that YARN-1492 fixes your issue. Optimize the use of DistributedCache Key: PIG-2672 URL: https://issues.apache.org/jira/browse/PIG-2672 Project: Pig Issue Type: Improvement Reporter: Rohini Palaniswamy Fix For: 0.13.0 Attachments: PIG-2672-5.patch, PIG-2672.patch Pig currently copies jar files to a temporary location in hdfs and then adds them to DistributedCache for each job launched. This is inefficient in terms of * Space - The jars are distributed to task trackers for every job taking up lot of local temporary space in tasktrackers. * Performance - The jar distribution impacts the job launch time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache
[ https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880084#comment-13880084 ] Dmitriy V. Ryaboy commented on PIG-2672: Seems like there is a lot of effort being spent here reinventing what is already designed for the general use case in the yarn ticket Aniket linked. Lets not let best be enemy of the good, and just get something in that will be decent for most cases, and if people don't like it, they can turn it off. This is an intermediate solution until that yarn patch goes in, at which point all of this becomes moot. Optimize the use of DistributedCache Key: PIG-2672 URL: https://issues.apache.org/jira/browse/PIG-2672 Project: Pig Issue Type: Improvement Reporter: Rohini Palaniswamy Fix For: 0.13.0 Attachments: PIG-2672-5.patch, PIG-2672.patch Pig currently copies jar files to a temporary location in hdfs and then adds them to DistributedCache for each job launched. This is inefficient in terms of * Space - The jars are distributed to task trackers for every job taking up lot of local temporary space in tasktrackers. * Performance - The jar distribution impacts the job launch time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852133#comment-13852133 ] Dmitriy V. Ryaboy commented on PIG-3630: Is this a AvroStorage or data issue? grunt import '/Users/dmitriy/tmp/tf_idf.macro'; grunt register build/ivy/lib/Pig/avro-1.7.4.jar grunt register build/ivy/lib/Pig/json-simple-1.1.jar grunt register contrib/piggybank/java/piggybank.jar grunt define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); grunt emails = load '/Users/dmitriy/Downloads/enron.avro'; grunt describe emails Schema for emails unknown. (this is the same in both pig 0.11 and pig 0.12). Can you provide a simple reproducible use case that doesn't involve Avro, etc? Can you share what debugging you've done so far? Macros that work in Pig 0.11 fail in Pig 0.12 :( Key: PIG-3630 URL: https://issues.apache.org/jira/browse/PIG-3630 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.12.0 Reporter: Russell Jurney http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html The ntf-idf macro listed there works under 0.11. Under 0.12, it results in this: 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error messages to: /private/tmp/pig_1387260559120.log 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from SCDynamicStore 2013-12-16 22:09:19,528 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) at org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) at org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) at org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) at org.apache.pig.PigServer.execute(PigServer.java:1302) at org.apache.pig.PigServer.executeBatch(PigServer.java:391) at org.apache.pig.PigServer.executeBatch(PigServer.java:369) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852221#comment-13852221 ] Dmitriy V. Ryaboy commented on PIG-3630: Sure enough. Once I add that, everything works in 0.12 and now I can't reproduce the bug you are reporting. My pig is [tw-mbp13-dryaboy-2 pig-0.12]$ ./bin/pig -version Apache Pig version 0.12.0-SNAPSHOT (r1526044) compiled Dec 18 2013, 12:15:04 same with more recent: [tw-mbp13-dryaboy-2 pig-0.12]$ ./bin/pig -version Apache Pig version 0.12.1-SNAPSHOT (r1552124) compiled Dec 18 2013, 14:00:21 Back to you to get a reproducible test case Macros that work in Pig 0.11 fail in Pig 0.12 :( Key: PIG-3630 URL: https://issues.apache.org/jira/browse/PIG-3630 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.12.0 Reporter: Russell Jurney http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html The ntf-idf macro listed there works under 0.11. Under 0.12, it results in this: 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error messages to: /private/tmp/pig_1387260559120.log 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from SCDynamicStore 2013-12-16 22:09:19,528 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) at org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) at org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) at org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) at org.apache.pig.PigServer.execute(PigServer.java:1302) at org.apache.pig.PigServer.executeBatch(PigServer.java:391) at org.apache.pig.PigServer.executeBatch(PigServer.java:369) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:600) at org.apache.pig.Main.main(Main.java:156) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852272#comment-13852272 ] Dmitriy V. Ryaboy commented on PIG-3630: That one fails in both 0.11 and 0.12. Do you have something that works in 11 but fails in 12? Macros that work in Pig 0.11 fail in Pig 0.12 :( Key: PIG-3630 URL: https://issues.apache.org/jira/browse/PIG-3630 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.12.0 Reporter: Russell Jurney http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html The ntf-idf macro listed there works under 0.11. Under 0.12, it results in this: 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error messages to: /private/tmp/pig_1387260559120.log 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from SCDynamicStore 2013-12-16 22:09:19,528 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) at org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) at org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) at org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) at org.apache.pig.PigServer.execute(PigServer.java:1302) at org.apache.pig.PigServer.executeBatch(PigServer.java:391) at org.apache.pig.PigServer.executeBatch(PigServer.java:369) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:600) at org.apache.pig.Main.main(Main.java:156) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852281#comment-13852281 ] Dmitriy V. Ryaboy commented on PIG-3630: Actually that failed in 11 due to missing register statements. It does work in 11 if you work around the Avro stuff. Ok, now we have something to look at... Macros that work in Pig 0.11 fail in Pig 0.12 :( Key: PIG-3630 URL: https://issues.apache.org/jira/browse/PIG-3630 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.12.0 Reporter: Russell Jurney http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html The ntf-idf macro listed there works under 0.11. Under 0.12, it results in this: 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error messages to: /private/tmp/pig_1387260559120.log 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from SCDynamicStore 2013-12-16 22:09:19,528 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) at org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) at org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) at org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) at org.apache.pig.PigServer.execute(PigServer.java:1302) at org.apache.pig.PigServer.executeBatch(PigServer.java:391) at org.apache.pig.PigServer.executeBatch(PigServer.java:369) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:600) at org.apache.pig.Main.main(Main.java:156) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852299#comment-13852299 ] Dmitriy V. Ryaboy commented on PIG-3630: Now that registers are in place, it works in 12 as well: {code} Input(s): Successfully read records from: /Users/dmitriy/Downloads/trimmed_reviews.avro Output(s): Successfully stored records in: file:///Users/dmitriy/src/pig-0.12/tmp/pig_12_ntf_idf_scores Job DAG: job_local_0001 - job_local_0003,job_local_0002, job_local_0003 - job_local_0005, job_local_0002 - job_local_0004, job_local_0004 - job_local_0005, job_local_0005 2013-12-18 15:22:02,012 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! {code} Back to you... Macros that work in Pig 0.11 fail in Pig 0.12 :( Key: PIG-3630 URL: https://issues.apache.org/jira/browse/PIG-3630 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.12.0 Reporter: Russell Jurney http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html The ntf-idf macro listed there works under 0.11. Under 0.12, it results in this: 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error messages to: /private/tmp/pig_1387260559120.log 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from SCDynamicStore 2013-12-16 22:09:19,528 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) at org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) at org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) at org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) at org.apache.pig.PigServer.execute(PigServer.java:1302) at org.apache.pig.PigServer.executeBatch(PigServer.java:391) at org.apache.pig.PigServer.executeBatch(PigServer.java:369) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:600) at org.apache.pig.Main.main(Main.java:156) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Assigned] (PIG-3621) Python Avro library can't read Avros made with builtin AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-3621: -- Assignee: (was: Dmitriy V. Ryaboy) Uh, no thanks :) Python Avro library can't read Avros made with builtin AvroStorage -- Key: PIG-3621 URL: https://issues.apache.org/jira/browse/PIG-3621 Project: Pig Issue Type: Bug Components: internal-udfs Affects Versions: 0.12.0 Reporter: Russell Jurney Fix For: 0.12.1, 0.13.0 Attachments: PIG-3631-2.patch, PIG-3631.patch Using this script: from avro import schema, datafile, io import pprint import sys import json field_id = None # Optional key to print if (len(sys.argv) 2): field_id = sys.argv[2] # Test reading avros rec_reader = io.DatumReader() # Create a 'data file' (avro file) reader df_reader = datafile.DataFileReader( open(sys.argv[1]), rec_reader ) the last line fails with: Traceback (most recent call last): File /Users/rjurney/bin/cat_avro, line 22, in module rec_reader File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/datafile.py, line 247, in __init__ self.datum_reader.writers_schema = schema.parse(self.get_meta(SCHEMA_KEY)) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 784, in parse return make_avsc_object(json_data, names) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 740, in make_avsc_object return RecordSchema(name, namespace, fields, names, type, doc, other_props) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 653, in __init__ other_props) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 294, in __init__ new_name = names.add_name(name, namespace, self) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 268, in add_name raise SchemaParseException(fail_msg) avro.schema.SchemaParseException: record is a reserved type name. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3621) Python Avro library can't read Avros made with builtin AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852441#comment-13852441 ] Dmitriy V. Ryaboy commented on PIG-3621: Sorry, that was a no to the assignment. Cheolsoo, does that var get set elsewhere? Why remove the logic for checking empty string, etc, and using a default? Python Avro library can't read Avros made with builtin AvroStorage -- Key: PIG-3621 URL: https://issues.apache.org/jira/browse/PIG-3621 Project: Pig Issue Type: Bug Components: internal-udfs Affects Versions: 0.12.0 Reporter: Russell Jurney Fix For: 0.12.1, 0.13.0 Attachments: PIG-3631-2.patch, PIG-3631.patch Using this script: from avro import schema, datafile, io import pprint import sys import json field_id = None # Optional key to print if (len(sys.argv) 2): field_id = sys.argv[2] # Test reading avros rec_reader = io.DatumReader() # Create a 'data file' (avro file) reader df_reader = datafile.DataFileReader( open(sys.argv[1]), rec_reader ) the last line fails with: Traceback (most recent call last): File /Users/rjurney/bin/cat_avro, line 22, in module rec_reader File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/datafile.py, line 247, in __init__ self.datum_reader.writers_schema = schema.parse(self.get_meta(SCHEMA_KEY)) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 784, in parse return make_avsc_object(json_data, names) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 740, in make_avsc_object return RecordSchema(name, namespace, fields, names, type, doc, other_props) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 653, in __init__ other_props) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 294, in __init__ new_name = names.add_name(name, namespace, self) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 268, in add_name raise SchemaParseException(fail_msg) avro.schema.SchemaParseException: record is a reserved type name. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3621) Python Avro library can't read Avros made with builtin AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852500#comment-13852500 ] Dmitriy V. Ryaboy commented on PIG-3621: +1 Python Avro library can't read Avros made with builtin AvroStorage -- Key: PIG-3621 URL: https://issues.apache.org/jira/browse/PIG-3621 Project: Pig Issue Type: Bug Components: internal-udfs Affects Versions: 0.12.0 Reporter: Russell Jurney Fix For: 0.12.1, 0.13.0 Attachments: PIG-3621-3.patch, PIG-3631-2.patch, PIG-3631.patch Using this script: from avro import schema, datafile, io import pprint import sys import json field_id = None # Optional key to print if (len(sys.argv) 2): field_id = sys.argv[2] # Test reading avros rec_reader = io.DatumReader() # Create a 'data file' (avro file) reader df_reader = datafile.DataFileReader( open(sys.argv[1]), rec_reader ) the last line fails with: Traceback (most recent call last): File /Users/rjurney/bin/cat_avro, line 22, in module rec_reader File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/datafile.py, line 247, in __init__ self.datum_reader.writers_schema = schema.parse(self.get_meta(SCHEMA_KEY)) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 784, in parse return make_avsc_object(json_data, names) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 740, in make_avsc_object return RecordSchema(name, namespace, fields, names, type, doc, other_props) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 653, in __init__ other_props) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 294, in __init__ new_name = names.add_name(name, namespace, self) File /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/avro/schema.py, line 268, in add_name raise SchemaParseException(fail_msg) avro.schema.SchemaParseException: record is a reserved type name. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851239#comment-13851239 ] Dmitriy V. Ryaboy commented on PIG-3630: Could you link to the code directly, rather than the book? The Safari website is giving me interstitials and other unpleasant things. Have you investigated the schemas of relations referred to in the error message, and checked if your field references make sense? 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. Macros that work in Pig 0.11 fail in Pig 0.12 :( Key: PIG-3630 URL: https://issues.apache.org/jira/browse/PIG-3630 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.12.0 Reporter: Russell Jurney http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html The ntf-idf macro listed there works under 0.11. Under 0.12, it results in this: 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error messages to: /private/tmp/pig_1387260559120.log 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from SCDynamicStore 2013-12-16 22:09:19,528 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) at org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) at org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) at org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) at org.apache.pig.PigServer.execute(PigServer.java:1302) at org.apache.pig.PigServer.executeBatch(PigServer.java:391) at org.apache.pig.PigServer.executeBatch(PigServer.java:369) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:600) at org.apache.pig.Main.main(Main.java:156) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Commented] (PIG-3630) Macros that work in Pig 0.11 fail in Pig 0.12 :(
[ https://issues.apache.org/jira/browse/PIG-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851412#comment-13851412 ] Dmitriy V. Ryaboy commented on PIG-3630: That macro does not refer to a field called tf_idf. Could you post a fully reproducible test case? Macros that work in Pig 0.11 fail in Pig 0.12 :( Key: PIG-3630 URL: https://issues.apache.org/jira/browse/PIG-3630 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.12.0 Reporter: Russell Jurney http://my.safaribooksonline.com/book/databases/9781449326890/7dot-exploring-data-with-reports/i_sect13_id196600_html The ntf-idf macro listed there works under 0.11. Under 0.12, it results in this: 13/12/16 22:09:19 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0-SNAPSHOT (rUnversioned directory) compiled Dec 09 2013, 14:37:29 2013-12-16 22:09:19,159 [main] INFO org.apache.pig.Main - Logging error messages to: /private/tmp/pig_1387260559120.log 2013-12-16 22:09:19.268 java[38060:1903] Unable to load realm info from SCDynamicStore 2013-12-16 22:09:19,528 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. 2013-12-16 22:09:20,189 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: at expanding macro 'tf_idf' (per_business.pig:9) file per_business.pig, line 35, column 17 Invalid field projection. Projected field [tf_idf] does not exist in schema: business_id:chararray,token:chararray,term_freq:double,num_docs_with_token:long. at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191) at org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174) at org.apache.pig.newplan.logical.visitor.ColumnAliasConversionVisitor$1.visit(ColumnAliasConversionVisitor.java:53) at org.apache.pig.newplan.logical.expression.ProjectExpression.accept(ProjectExpression.java:215) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:142) at org.apache.pig.newplan.logical.relational.LOInnerLoad.accept(LOInnerLoad.java:128) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124) at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1694) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1686) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1387) at org.apache.pig.PigServer.execute(PigServer.java:1302) at org.apache.pig.PigServer.executeBatch(PigServer.java:391) at org.apache.pig.PigServer.executeBatch(PigServer.java:369) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:600) at org.apache.pig.Main.main(Main.java:156) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813080#comment-13813080 ] Dmitriy V. Ryaboy commented on PIG-3453: Mridul: In our experience at Twitter, Trident introduces pretty high overhead; in Summingbird, we relax the data delivery guarantees to get better throughput, and use Storm directly. Perhaps you want to try putting pig on top of Summingbird? If you did that, we might even be able to help :). In any case, interested in seeing how all of this will turn out. Cheolsoo: No real objections to svn branch. In the past I've found it far easier to cooperate on significant branches on github, rather than maintain an svn branch (you can easily have multiple branches, reviews are easier, etc). That's how Bill Graham and I did the HBaseStorage rewrite a few years back. But really that's up to developers doing the work. Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature Affects Versions: 0.13.0 Reporter: Pradeep Gollakota Assignee: Jacob Perkins Labels: storm Fix For: 0.13.0 Attachments: storm-integration.patch There is a lot of interest around implementing a Storm backend to Pig for streaming processing. The proposal and initial discussions can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813150#comment-13813150 ] Dmitriy V. Ryaboy commented on PIG-3453: Oh I absolutely just meant collaboration on initial contrib to happen in github, for expediency. and fast iteration. Of course once this work is in a committable/mergeable state, it should go into Apache. Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature Affects Versions: 0.13.0 Reporter: Pradeep Gollakota Assignee: Jacob Perkins Labels: storm Fix For: 0.13.0 Attachments: storm-integration.patch There is a lot of interest around implementing a Storm backend to Pig for streaming processing. The proposal and initial discussions can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811726#comment-13811726 ] Dmitriy V. Ryaboy commented on PIG-3453: I don't see why Jacob can't keep working in a github branch... easier to look at what's changing, and he can keep merging the (read-only) git mirror from apache to keep up with changes. Jacob I see you are using Trident. Have you looked at your throughput numbers, vs going directly to storm? Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature Affects Versions: 0.13.0 Reporter: Pradeep Gollakota Assignee: Jacob Perkins Labels: storm Fix For: 0.13.0 Attachments: storm-integration.patch There is a lot of interest around implementing a Storm backend to Pig for streaming processing. The proposal and initial discussions can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3549) Print hadoop jobids for failed, killed job
[ https://issues.apache.org/jira/browse/PIG-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808007#comment-13808007 ] Dmitriy V. Ryaboy commented on PIG-3549: OMG. Thanks. +1. Print hadoop jobids for failed, killed job -- Key: PIG-3549 URL: https://issues.apache.org/jira/browse/PIG-3549 Project: Pig Issue Type: Bug Affects Versions: 0.12.0 Reporter: Aniket Mokashi Assignee: Aniket Mokashi Fix For: 0.12.1 Attachments: PIG-3549.patch It would be better if we dump the hadoop job ids for failed, killed jobs in pig log. Right now, log looks like following- {noformat} ERROR org.apache.pig.tools.grunt.Grunt: ERROR 6017: Job failed! Error - NA INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher: Job job_pigexec_1 killed {noformat} From that its hard to say which hadoop job failed if there are multiple jobs running in parallel. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807458#comment-13807458 ] Dmitriy V. Ryaboy commented on PIG-3453: [~azaroth]: may I suggest https://github.com/twitter/algebird for this and many other approximate counting use cases? :-) Already in use by scalding, summingbird, and spark. Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature Reporter: Pradeep Gollakota Labels: storm There is a lot of interest around implementing a Storm backend to Pig for streaming processing. The proposal and initial discussions can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784627#comment-13784627 ] Dmitriy V. Ryaboy commented on PIG-3445: That's a great addition, thanks Lorand. The code looks really tidy now. Looks like ParquetUtil is actually general util? Maybe add that functionality to org.apache.pig.impl.util.JarManager or something along those lines? [~julienledem] do we need to publish a new artifact version so fastutil isn't required for dictionary encoding? Make Parquet format available out of the box in Pig --- Key: PIG-3445 URL: https://issues.apache.org/jira/browse/PIG-3445 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Fix For: 0.12.0 Attachments: PIG-3445-2.patch, PIG-3445-3.patch, PIG-3445.patch We would add the Parquet jar in the Pig packages to make it available out of the box to pig users. On top of that we could add the parquet.pig package to the list of packages to search for UDFs. (alternatively, the parquet jar could contain classes name or.apache.pig.builtin.ParquetLoader and ParquetStorer) This way users can use Parquet simply by typing: A = LOAD 'foo' USING ParquetLoader(); STORE A INTO 'bar' USING ParquetStorer(); -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3082) outputSchema of a UDF allows two usages when describing a Tuple schema
[ https://issues.apache.org/jira/browse/PIG-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784720#comment-13784720 ] Dmitriy V. Ryaboy commented on PIG-3082: So... that's a breaking change, a bunch of UDF will fail under 12. Intended? outputSchema of a UDF allows two usages when describing a Tuple schema -- Key: PIG-3082 URL: https://issues.apache.org/jira/browse/PIG-3082 Project: Pig Issue Type: Bug Reporter: Julien Le Dem Assignee: Jonathan Coveney Fix For: 0.12.0 Attachments: PIG-3082-0.patch, PIG-3082-1.patch When defining an evalfunc that returns a Tuple there are two ways you can implement outputSchema(). - The right way: return a schema that contains one Field that contains the type and schema of the return type of the UDF - The unreliable way: return a schema that contains more than one field and it will be understood as a tuple schema even though there is no type (which is in Field class) to specify that. This is particularly deceitful when the output schema is derived from the input schema and the outputted Tuple sometimes contain only one field. In such cases Pig understands the output schema as a tuple only if there is more than one field. And sometimes it works, sometimes it does not. We should at least issue a warning (backward compatibility) if not plain throw an exception when the output schema contains more than one Field. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13783614#comment-13783614 ] Dmitriy V. Ryaboy commented on PIG-3445: [~lbendig] might be more succinct to use StoreFuncWrapper ? Make Parquet format available out of the box in Pig --- Key: PIG-3445 URL: https://issues.apache.org/jira/browse/PIG-3445 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Fix For: 0.12.0 Attachments: PIG-3445-2.patch, PIG-3445.patch We would add the Parquet jar in the Pig packages to make it available out of the box to pig users. On top of that we could add the parquet.pig package to the list of packages to search for UDFs. (alternatively, the parquet jar could contain classes name or.apache.pig.builtin.ParquetLoader and ParquetStorer) This way users can use Parquet simply by typing: A = LOAD 'foo' USING ParquetLoader(); STORE A INTO 'bar' USING ParquetStorer(); -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13782112#comment-13782112 ] Dmitriy V. Ryaboy commented on PIG-3480: That is fine with me, lets make sequence file optional. It will let people avoid the bug I am encountering, an also do things like use snappy compression. TFile-based tmpfile compression crashes in some cases - Key: PIG-3480 URL: https://issues.apache.org/jira/browse/PIG-3480 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.12.0 Attachments: PIG-3480.patch When pig tmpfile compression is on, some jobs fail inside core hadoop internals. Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Affects Version/s: 0.12 Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.12, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Dmitriy V. Ryaboy Priority: Critical Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3479: --- Fix Version/s: 0.12 Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization -- Key: PIG-3479 URL: https://issues.apache.org/jira/browse/PIG-3479 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.12 Attachments: PIG-3479.patch While working on something unrelated I discovered some serialization errors with recently added data types, and a heavy use of reflection slowing down PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Fix Version/s: 0.12 Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.12, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Dmitriy V. Ryaboy Priority: Critical Fix For: 0.12 Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776172#comment-13776172 ] Dmitriy V. Ryaboy commented on PIG-3445: The size of the dependency introduced by this is orders of magnitude smaller than the HBase (or Avro) one, since everything comes from a single project (unlike HBase's liberal use of guava, metric, ZK, and everything else under the sun). The total size is less than 1 meg. Can we add parquet.pig to udf import list in the same patch? Make Parquet format available out of the box in Pig --- Key: PIG-3445 URL: https://issues.apache.org/jira/browse/PIG-3445 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Attachments: PIG-3445.patch We would add the Parquet jar in the Pig packages to make it available out of the box to pig users. On top of that we could add the parquet.pig package to the list of packages to search for UDFs. (alternatively, the parquet jar could contain classes name or.apache.pig.builtin.ParquetLoader and ParquetStorer) This way users can use Parquet simply by typing: A = LOAD 'foo' USING ParquetLoader(); STORE A INTO 'bar' USING ParquetStorer(); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3445: --- Fix Version/s: 0.12 Make Parquet format available out of the box in Pig --- Key: PIG-3445 URL: https://issues.apache.org/jira/browse/PIG-3445 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Fix For: 0.12 Attachments: PIG-3445.patch We would add the Parquet jar in the Pig packages to make it available out of the box to pig users. On top of that we could add the parquet.pig package to the list of packages to search for UDFs. (alternatively, the parquet jar could contain classes name or.apache.pig.builtin.ParquetLoader and ParquetStorer) This way users can use Parquet simply by typing: A = LOAD 'foo' USING ParquetLoader(); STORE A INTO 'bar' USING ParquetStorer(); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3480) TFile-based tmpfile compression crashes in some cases
Dmitriy V. Ryaboy created PIG-3480: -- Summary: TFile-based tmpfile compression crashes in some cases Key: PIG-3480 URL: https://issues.apache.org/jira/browse/PIG-3480 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.12 When pig tmpfile compression is on, some jobs fail inside core hadoop internals. Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776602#comment-13776602 ] Dmitriy V. Ryaboy commented on PIG-3480: For most of the tasks that fail, no stack trace is available on Hadoop 1 (they just die with nonzero status 134). I did catch one task with a stack trace: {code} java.io.IOException: Error while reading compressed data at org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(Map {code} No idea if this is relevant. This problem does happen consistently -- 100% of the time on my script that shows this problem. Anecdotally, about 1/10 of our production scripts encounter this; I have not been able to establish a pattern yet. TFile-based tmpfile compression crashes in some cases - Key: PIG-3480 URL: https://issues.apache.org/jira/browse/PIG-3480 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.12 When pig tmpfile compression is on, some jobs fail inside core hadoop internals. Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776602#comment-13776602 ] Dmitriy V. Ryaboy edited comment on PIG-3480 at 9/24/13 6:36 PM: - For most of the tasks that fail, no stack trace is available on Hadoop 1 (they just die with nonzero status 134). I did catch one task with a stack trace: {code} java.io.IOException: Error while reading compressed data at org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(Map {code} No idea if this is relevant. This problem does happen consistently -- 100% of the time on my script that shows this problem. Anecdotally, about 1/10 of our production scripts encounter this; I have not been able to establish a pattern yet. was (Author: dvryaboy): For most of the tasks that fail, no stack trace is available on Hadoop 1 (they just die with nonzero status 134). I did catch one task with a stack trace: {code} java.io.IOException: Error while reading compressed data at org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:205) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:373) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:357) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:389) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(Map {code} No idea if this is relevant. This problem does happen consistently -- 100% of the time on my script that shows this problem. Anecdotally, about 1/10 of our production scripts encounter this; I have not been able to establish a pattern yet. TFile-based tmpfile compression crashes in some cases - Key: PIG-3480 URL: https://issues.apache.org/jira/browse/PIG-3480 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.12 When pig tmpfile compression is on, some jobs fail inside core hadoop internals. Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3480: --- Attachment: PIG-3480.patch Attaching a rough patch which replaces use of TFile with SequenceFile. Next steps: - evaluate effect on size of compressed data for TFile vs SeqFile when TFile does work - add tests, make TFile tests pass (in this file they fail, because of course TFile is not being used) - make SeqFile the default method, since it doesn't break - allow TFile use by a switch, since current users may want to keep it. I would prefer to not do that, but might if the first step shows significant differences. Thoughts? Especially from folks using TFile-based compression in production ([~rohini]?) TFile-based tmpfile compression crashes in some cases - Key: PIG-3480 URL: https://issues.apache.org/jira/browse/PIG-3480 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.12 Attachments: PIG-3480.patch When pig tmpfile compression is on, some jobs fail inside core hadoop internals. Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377 ] Dmitriy V. Ryaboy commented on PIG-3480: [~knoguchi] yeah, I'm not sure the stack trace is relevant -- it's the only part that's not consistent about this. The problem goes away when I set pig.tmpfilecompression to false, or when I replace TFile with SequenceFile. I've also seen stack traces that were inside TFile, and had to do with some LZO decoding issues.. the actual error is really hard to capture, other than the fact that mappers fail consistently. TFile-based tmpfile compression crashes in some cases - Key: PIG-3480 URL: https://issues.apache.org/jira/browse/PIG-3480 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.12 Attachments: PIG-3480.patch When pig tmpfile compression is on, some jobs fail inside core hadoop internals. Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776732#comment-13776732 ] Dmitriy V. Ryaboy commented on PIG-3480: Rohini, do you guys use lzo or gz compression? Maybe it's just lzo that's breaking. I can test gz. That never actually occurred to me, I just assumed this is completely busted because I could never get it to work (since 2010..) TFile-based tmpfile compression crashes in some cases - Key: PIG-3480 URL: https://issues.apache.org/jira/browse/PIG-3480 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.12 Attachments: PIG-3480.patch When pig tmpfile compression is on, some jobs fail inside core hadoop internals. Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3480) TFile-based tmpfile compression crashes in some cases
[ https://issues.apache.org/jira/browse/PIG-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776728#comment-13776728 ] Dmitriy V. Ryaboy commented on PIG-3480: Rohini I suspect this might be something about complex data types, which afaik are pretty rare at Y! and extremely common at Twitter. TFile-based tmpfile compression crashes in some cases - Key: PIG-3480 URL: https://issues.apache.org/jira/browse/PIG-3480 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Fix For: 0.12 Attachments: PIG-3480.patch When pig tmpfile compression is on, some jobs fail inside core hadoop internals. Suspect TFile is the problem, because an experiment in replacing TFile with SequenceFile succeeded. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3479: --- Attachment: PIG-3479.whitespace.patch Same patch, but with whitespace changes. Committing this. Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization -- Key: PIG-3479 URL: https://issues.apache.org/jira/browse/PIG-3479 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.12.0 Attachments: PIG-3479.patch, PIG-3479.whitespace.patch While working on something unrelated I discovered some serialization errors with recently added data types, and a heavy use of reflection slowing down PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3479: --- Resolution: Fixed Release Note: Skewed join internals improved to get 10% or better improvement on reducers by eliminating unnecessary reflection. Status: Resolved (was: Patch Available) Committed to trunk and 0.12 Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization -- Key: PIG-3479 URL: https://issues.apache.org/jira/browse/PIG-3479 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.12.0 Attachments: PIG-3479.patch, PIG-3479.whitespace.patch While working on something unrelated I discovered some serialization errors with recently added data types, and a heavy use of reflection slowing down PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3445) Make Parquet format available out of the box in Pig
[ https://issues.apache.org/jira/browse/PIG-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776879#comment-13776879 ] Dmitriy V. Ryaboy commented on PIG-3445: Other loaders like csv, avro, json, xml, etc (even RC, though it's in piggybank due to heavy dependencies and lack of support) are all in already so I don't see this as unfair, but as consistent. Not packaging the pq jars into pig monojar and instead adding them, the way we add guava et al for hbase, sounds like a good idea. [~julienledem] should we do that by providing a simple wrapper in pig builtins, or by messing with the job conf in parquet's own loader/storer? Make Parquet format available out of the box in Pig --- Key: PIG-3445 URL: https://issues.apache.org/jira/browse/PIG-3445 Project: Pig Issue Type: Improvement Reporter: Julien Le Dem Fix For: 0.12.0 Attachments: PIG-3445.patch We would add the Parquet jar in the Pig packages to make it available out of the box to pig users. On top of that we could add the parquet.pig package to the list of packages to search for UDFs. (alternatively, the parquet jar could contain classes name or.apache.pig.builtin.ParquetLoader and ParquetStorer) This way users can use Parquet simply by typing: A = LOAD 'foo' USING ParquetLoader(); STORE A INTO 'bar' USING ParquetStorer(); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-3479: -- Assignee: Dmitriy V. Ryaboy Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization -- Key: PIG-3479 URL: https://issues.apache.org/jira/browse/PIG-3479 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: PIG-3479.patch While working on something unrelated I discovered some serialization errors with recently added data types, and a heavy use of reflection slowing down PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
[ https://issues.apache.org/jira/browse/PIG-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3479: --- Attachment: PIG-3479.patch Attaching a patch. I extended an existing test to test the serialziation.. it's the only place we test Nullables at all :(. Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization -- Key: PIG-3479 URL: https://issues.apache.org/jira/browse/PIG-3479 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: PIG-3479.patch While working on something unrelated I discovered some serialization errors with recently added data types, and a heavy use of reflection slowing down PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3479) Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization
Dmitriy V. Ryaboy created PIG-3479: -- Summary: Fix BigInt, BigDec, Date serialization. Improve perf of PigNullableWritable deserilization Key: PIG-3479 URL: https://issues.apache.org/jira/browse/PIG-3479 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Attachments: PIG-3479.patch While working on something unrelated I discovered some serialization errors with recently added data types, and a heavy use of reflection slowing down PigNullableWritable deserialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache
[ https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773636#comment-13773636 ] Dmitriy V. Ryaboy commented on PIG-2672: Aniket, can we prefix the properties with pig.? That way we won't conflict with potential properties from Hadoop, and it's a little easier to analyze stuff when looking at the jobconf. Optimize the use of DistributedCache Key: PIG-2672 URL: https://issues.apache.org/jira/browse/PIG-2672 Project: Pig Issue Type: Improvement Reporter: Rohini Palaniswamy Assignee: Aniket Mokashi Fix For: 0.12 Attachments: PIG-2672.patch Pig currently copies jar files to a temporary location in hdfs and then adds them to DistributedCache for each job launched. This is inefficient in terms of * Space - The jars are distributed to task trackers for every job taking up lot of local temporary space in tasktrackers. * Performance - The jar distribution impacts the job launch time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768649#comment-13768649 ] Dmitriy V. Ryaboy commented on PIG-3419: +1 to marking the interfaces as evolving. Pluggable Execution Engine --- Key: PIG-3419 URL: https://issues.apache.org/jira/browse/PIG-3419 Project: Pig Issue Type: New Feature Affects Versions: 0.12 Reporter: Achal Soni Assignee: Achal Soni Priority: Minor Fix For: 0.12 Attachments: execengine.patch, mapreduce_execengine.patch, stats_scriptstate.patch, test_failures.txt, test_suite.patch, updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, updated-8-29-2013-exec-engine.patch In an effort to adapt Pig to work using Apache Tez (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for a cleaner ExecutionEngine abstraction than existed before. The changes are not that major as Pig was already relatively abstracted out between the frontend and backend. The changes in the attached commit are essentially the barebones changes -- I tried to not change the structure of Pig's different components too much. I think it will be interesting to see in the future how we can refactor more areas of Pig to really honor this abstraction between the frontend and backend. Some of the changes was to reinstate an ExecutionEngine interface to tie together the front end and backend, and making the changes in Pig to delegate to the EE when necessary, and creating an MRExecutionEngine that implements this interface. Other work included changing ExecType to cycle through the ExecutionEngines on the classpath and select the appropriate one (this is done using Java ServiceLoader, exactly how MapReduce does for choosing the framework to use between local and distributed mode). Also I tried to make ScriptState, JobStats, and PigStats as abstract as possible in its current state. I think in the future some work will need to be done here to perhaps re-evaluate the usage of ScriptState and the responsibilities of the different statistics classes. I haven't touched the PPNL, but I think more abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767220#comment-13767220 ] Dmitriy V. Ryaboy commented on PIG-3419: This is not just for Tez. The point is to enable POC work (in branches, forks, etc) and not have each such attempt redo all the work in this ticket. It's the same reason we provide things like pluggable LoadFuncs to let people work on things they want to load we didn't think of loading. We should certainly work to stabilize 0.12 and fix issues like PIG-3457 Pluggable Execution Engine --- Key: PIG-3419 URL: https://issues.apache.org/jira/browse/PIG-3419 Project: Pig Issue Type: New Feature Affects Versions: 0.12 Reporter: Achal Soni Assignee: Achal Soni Priority: Minor Fix For: 0.12 Attachments: execengine.patch, mapreduce_execengine.patch, stats_scriptstate.patch, test_failures.txt, test_suite.patch, updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, updated-8-29-2013-exec-engine.patch In an effort to adapt Pig to work using Apache Tez (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for a cleaner ExecutionEngine abstraction than existed before. The changes are not that major as Pig was already relatively abstracted out between the frontend and backend. The changes in the attached commit are essentially the barebones changes -- I tried to not change the structure of Pig's different components too much. I think it will be interesting to see in the future how we can refactor more areas of Pig to really honor this abstraction between the frontend and backend. Some of the changes was to reinstate an ExecutionEngine interface to tie together the front end and backend, and making the changes in Pig to delegate to the EE when necessary, and creating an MRExecutionEngine that implements this interface. Other work included changing ExecType to cycle through the ExecutionEngines on the classpath and select the appropriate one (this is done using Java ServiceLoader, exactly how MapReduce does for choosing the framework to use between local and distributed mode). Also I tried to make ScriptState, JobStats, and PigStats as abstract as possible in its current state. I think in the future some work will need to be done here to perhaps re-evaluate the usage of ScriptState and the responsibilities of the different statistics classes. I haven't touched the PPNL, but I think more abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2965) RANDOM should allow seed initialization for ease of testing
[ https://issues.apache.org/jira/browse/PIG-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13760313#comment-13760313 ] Dmitriy V. Ryaboy commented on PIG-2965: A UDF essentially has a constructor and an exec method. foreach lines udf(foo) calls the exec method and passes to it the foo parameter. define udfinstance udf(foo) passes foo to the constructor, and makes an instance of the foo udf initialized in that way bound to udfinstance (so you can have many differently initialized udfs in the same script). You can read more info on all this in the docs about define keyword and the UDF author's guide. RANDOM should allow seed initialization for ease of testing --- Key: PIG-2965 URL: https://issues.apache.org/jira/browse/PIG-2965 Project: Pig Issue Type: Bug Reporter: Aneesh Sharma Assignee: Jonathan Coveney Labels: newbie Attachments: PIG-2965-0.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2965) RANDOM should allow seed initialization for ease of testing
[ https://issues.apache.org/jira/browse/PIG-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759609#comment-13759609 ] Dmitriy V. Ryaboy commented on PIG-2965: [~sdeneefe] are you sure you are using it right? I just tested and it works. Here's a test script you can run a few times : {code} define rand RANDOM('12345'); lines = load 'random.pig'; r = foreach lines generate rand(); dump r; {code} run using `pig -x local random.pig 2/dev/null` RANDOM should allow seed initialization for ease of testing --- Key: PIG-2965 URL: https://issues.apache.org/jira/browse/PIG-2965 Project: Pig Issue Type: Bug Reporter: Aneesh Sharma Assignee: Jonathan Coveney Labels: newbie Attachments: PIG-2965-0.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3048) Add mapreduce workflow information to job configuration
[ https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13753942#comment-13753942 ] Dmitriy V. Ryaboy commented on PIG-3048: no objections. after all, usage of the config info is purely optional. We've run into trouble before with information of this sort becoming very big and triggering JobConf too large errors. Might want to look at compression at some point. Add mapreduce workflow information to job configuration --- Key: PIG-3048 URL: https://issues.apache.org/jira/browse/PIG-3048 Project: Pig Issue Type: Improvement Reporter: Billie Rinaldi Assignee: Billie Rinaldi Fix For: 0.11.2 Attachments: PIG-3048.patch, PIG-3048.patch, PIG-3048.patch Adding workflow properties to the job configuration would enable logging and analysis of workflows in addition to individual MapReduce jobs. Suggested properties include a workflow ID, workflow name, adjacency list connecting nodes in the workflow, and the name of the current node in the workflow. mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with the application name e.g. pig_pigScriptId mapreduce.workflow.name - a name for the workflow, to distinguish this workflow from other workflows and to group different runs of the same workflow e.g. pig command line mapreduce.workflow.adjacency - an adjacency list for the workflow graph, encoded as mapreduce.workflow.adjacency.source node = comma-separated list of target nodes mapreduce.workflow.node.name - the name of the node corresponding to this MapReduce job in the workflow adjacency list -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754235#comment-13754235 ] Dmitriy V. Ryaboy commented on PIG-3419: [~billgraham] looping you in for Ambrose. Pluggable Execution Engine --- Key: PIG-3419 URL: https://issues.apache.org/jira/browse/PIG-3419 Project: Pig Issue Type: New Feature Affects Versions: 0.12 Reporter: Achal Soni Assignee: Achal Soni Priority: Minor Attachments: execengine.patch, mapreduce_execengine.patch, stats_scriptstate.patch, test_failures.txt, test_suite.patch, updated-8-22-2013-exec-engine.patch, updated-8-23-2013-exec-engine.patch, updated-8-27-2013-exec-engine.patch, updated-8-28-2013-exec-engine.patch, updated-8-29-2013-exec-engine.patch In an effort to adapt Pig to work using Apache Tez (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for a cleaner ExecutionEngine abstraction than existed before. The changes are not that major as Pig was already relatively abstracted out between the frontend and backend. The changes in the attached commit are essentially the barebones changes -- I tried to not change the structure of Pig's different components too much. I think it will be interesting to see in the future how we can refactor more areas of Pig to really honor this abstraction between the frontend and backend. Some of the changes was to reinstate an ExecutionEngine interface to tie together the front end and backend, and making the changes in Pig to delegate to the EE when necessary, and creating an MRExecutionEngine that implements this interface. Other work included changing ExecType to cycle through the ExecutionEngines on the classpath and select the appropriate one (this is done using Java ServiceLoader, exactly how MapReduce does for choosing the framework to use between local and distributed mode). Also I tried to make ScriptState, JobStats, and PigStats as abstract as possible in its current state. I think in the future some work will need to be done here to perhaps re-evaluate the usage of ScriptState and the responsibilities of the different statistics classes. I haven't touched the PPNL, but I think more abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749014#comment-13749014 ] Dmitriy V. Ryaboy commented on PIG-3419: Rohini, I want to reiterate that this patch has NO tez dependencies (if it does, that's a bug). The intention is not to make Tez possible. It's to make pluggable execution engines possible; and I do not want that functionality to be tied to a tez branch that will be unstable and in heavy development for the foreseeable future. This work will be immediately useful for the Spork (pig on spark) branch, for example. Also, it allows people to work with new runtimes *without modifying Pig*. So Tez-on-Pig doesn't even have to be done as a branch of this project, someone can go an experiment completely independently. For these reasons, I would like it in trunk. You make a great point about the danger of changing exceptions, public methods, etc. I believe that most of these are project-public, and annotated as such. Do you have specific methods you are concerned about? Ideally we would change as little as possible for the end user. Dmitriy Pluggable Execution Engine --- Key: PIG-3419 URL: https://issues.apache.org/jira/browse/PIG-3419 Project: Pig Issue Type: New Feature Affects Versions: 0.12 Reporter: Achal Soni Assignee: Achal Soni Priority: Minor Attachments: execengine.patch, mapreduce_execengine.patch, stats_scriptstate.patch, test_failures.txt, test_suite.patch, updated-8-22-2013-exec-engine.patch In an effort to adapt Pig to work using Apache Tez (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for a cleaner ExecutionEngine abstraction than existed before. The changes are not that major as Pig was already relatively abstracted out between the frontend and backend. The changes in the attached commit are essentially the barebones changes -- I tried to not change the structure of Pig's different components too much. I think it will be interesting to see in the future how we can refactor more areas of Pig to really honor this abstraction between the frontend and backend. Some of the changes was to reinstate an ExecutionEngine interface to tie together the front end and backend, and making the changes in Pig to delegate to the EE when necessary, and creating an MRExecutionEngine that implements this interface. Other work included changing ExecType to cycle through the ExecutionEngines on the classpath and select the appropriate one (this is done using Java ServiceLoader, exactly how MapReduce does for choosing the framework to use between local and distributed mode). Also I tried to make ScriptState, JobStats, and PigStats as abstract as possible in its current state. I think in the future some work will need to be done here to perhaps re-evaluate the usage of ScriptState and the responsibilities of the different statistics classes. I haven't touched the PPNL, but I think more abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747065#comment-13747065 ] Dmitriy V. Ryaboy commented on PIG-3419: I'd like this patch in trunk since it's not Tez-specific, and allows people to experiment with other runtimes (for example, Spark or Drill). Pluggable Execution Engine --- Key: PIG-3419 URL: https://issues.apache.org/jira/browse/PIG-3419 Project: Pig Issue Type: New Feature Affects Versions: 0.12 Reporter: Achal Soni Assignee: Achal Soni Priority: Minor Attachments: execengine.patch, finalpatch.patch, mapreduce_execengine.patch, stats_scriptstate.patch, test_suite.patch In an effort to adapt Pig to work using Apache Tez (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for a cleaner ExecutionEngine abstraction than existed before. The changes are not that major as Pig was already relatively abstracted out between the frontend and backend. The changes in the attached commit are essentially the barebones changes -- I tried to not change the structure of Pig's different components too much. I think it will be interesting to see in the future how we can refactor more areas of Pig to really honor this abstraction between the frontend and backend. Some of the changes was to reinstate an ExecutionEngine interface to tie together the front end and backend, and making the changes in Pig to delegate to the EE when necessary, and creating an MRExecutionEngine that implements this interface. Other work included changing ExecType to cycle through the ExecutionEngines on the classpath and select the appropriate one (this is done using Java ServiceLoader, exactly how MapReduce does for choosing the framework to use between local and distributed mode). Also I tried to make ScriptState, JobStats, and PigStats as abstract as possible in its current state. I think in the future some work will need to be done here to perhaps re-evaluate the usage of ScriptState and the responsibilities of the different statistics classes. I haven't touched the PPNL, but I think more abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738285#comment-13738285 ] Dmitriy V. Ryaboy commented on PIG-3419: Hi Achal, That's a large patch. Can you give us a roadmap for reading it -- what are the changes, at a high level? It looks like you had to change a bunch of stuff that's not (at first glance) directly related to exec mode. Procedurally: - please generate the patch using 'git diff -no-prefix' since the apache pig master is on svn - please post the complete patch to Review Board, for ease of commenting - please make sure that all new files have the apache license headers at the top Thanks -D Pluggable Execution Engine --- Key: PIG-3419 URL: https://issues.apache.org/jira/browse/PIG-3419 Project: Pig Issue Type: New Feature Affects Versions: 0.12 Reporter: Achal Soni Priority: Minor Attachments: pluggable_execengine.patch In an effort to adapt Pig to work using Apache Tez (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for a cleaner ExecutionEngine abstraction than existed before. The changes are not that major as Pig was already relatively abstracted out between the frontend and backend. The changes in the attached commit are essentially the barebones changes -- I tried to not change the structure of Pig's different components too much. I think it will be interesting to see in the future how we can refactor more areas of Pig to really honor this abstraction between the frontend and backend. Some of the changes was to reinstate an ExecutionEngine interface to tie together the front end and backend, and making the changes in Pig to delegate to the EE when necessary, and creating an MRExecutionEngine that implements this interface. Other work included changing ExecType to cycle through the ExecutionEngines on the classpath and select the appropriate one (this is done using Java ServiceLoader, exactly how MapReduce does for choosing the framework to use between local and distributed mode). Also I tried to make ScriptState, JobStats, and PigStats as abstract as possible in its current state. I think in the future some work will need to be done here to perhaps re-evaluate the usage of ScriptState and the responsibilities of the different statistics classes. I haven't touched the PPNL, but I think more abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3419) Pluggable Execution Engine
[ https://issues.apache.org/jira/browse/PIG-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738288#comment-13738288 ] Dmitriy V. Ryaboy commented on PIG-3419: oh 3 more things :) I thought you found your way around the -y argument? I still see that in there. Don't comment out blocks of code, just delete them Add some documentation about creating new Exec Engines to the xml-based docs, or at least post it here. Just having it in javadocs is not sufficient. Pluggable Execution Engine --- Key: PIG-3419 URL: https://issues.apache.org/jira/browse/PIG-3419 Project: Pig Issue Type: New Feature Affects Versions: 0.12 Reporter: Achal Soni Priority: Minor Attachments: pluggable_execengine.patch In an effort to adapt Pig to work using Apache Tez (https://issues.apache.org/jira/browse/TEZ), I made some changes to allow for a cleaner ExecutionEngine abstraction than existed before. The changes are not that major as Pig was already relatively abstracted out between the frontend and backend. The changes in the attached commit are essentially the barebones changes -- I tried to not change the structure of Pig's different components too much. I think it will be interesting to see in the future how we can refactor more areas of Pig to really honor this abstraction between the frontend and backend. Some of the changes was to reinstate an ExecutionEngine interface to tie together the front end and backend, and making the changes in Pig to delegate to the EE when necessary, and creating an MRExecutionEngine that implements this interface. Other work included changing ExecType to cycle through the ExecutionEngines on the classpath and select the appropriate one (this is done using Java ServiceLoader, exactly how MapReduce does for choosing the framework to use between local and distributed mode). Also I tried to make ScriptState, JobStats, and PigStats as abstract as possible in its current state. I think in the future some work will need to be done here to perhaps re-evaluate the usage of ScriptState and the responsibilities of the different statistics classes. I haven't touched the PPNL, but I think more abstraction is needed here, perhaps in a separate patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723065#comment-13723065 ] Dmitriy V. Ryaboy commented on PIG-3325: Urgh, you are right of course. I can move the .next() call into the for loop... but I wonder if that will slow us down again. Will check. Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Dmitriy V. Ryaboy Priority: Critical Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Assignee: Dmitriy V. Ryaboy (was: Mark Wagner) Status: Patch Available (was: Open) marking as patch available. please review. Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11.1, 0.11, 0.11.2 Reporter: Mark Wagner Assignee: Dmitriy V. Ryaboy Priority: Critical Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696209#comment-13696209 ] Dmitriy V. Ryaboy commented on PIG-3325: Ok I started looking at this, will update with a patch shortly. In the meantime -- my benchmark shows Mark's patch improves perf on small bags of 20-100 elements, but causes extremely poor performance for large bags. I created a benchmark that does 100 rounds of creating a bag of N elements, for values of N in [1,20,100,1000]. These sets of 100 rounds are run 15 times each, performance of the first 5 is thrown out to account for system warmup / jit optimizations. Results: ||Num Tuples in Bag || Trunk avg || Patch 1 avg || | 1 | round: 0.00 | round: 0.00 | | 20 | round: 0.01 | round: 0.00 | | 100 | round: 0.13 | round: 0.00 | | 1000 | round: 0.19 | round: 1.20 | Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Mark Wagner Priority: Critical Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Attachment: PIG-3325.2.patch Updating with a patch. Results: ||Num Tuples in Bag || Trunk avg || Patch 1 avg || Patch 2 avg || | 1 | round: 0.00 | round: 0.00 | round: 0.00 | | 20 | round: 0.01 | round: 0.00 | round: 0.00 | | 100 | round: 0.13 | round: 0.00 | round: 0.00 | 1000 | round: 0.19 | round: 1.20 | round: 0.03 | I also ran Mark's bench test in a loop 10 times (again, to account for jit effects). Results are as follows: My Patch, Mark's test 7050 ns 450 ns 440 ns 550 ns 440 ns 440 ns 440 ns 440 ns 440 ns 540 ns 410 ns 440 ns 440 ns 430 ns 460 ns Trunk, Mark's test 243240 ns 156640 ns 25440 ns 23470 ns 18930 ns 20710 ns 16890 ns 20210 ns 17630 ns 17900 ns 21420 ns 22550 ns 22900 ns 19800 ns 16770 ns Mark's patch, Mark's Test 8480 ns 2750 ns 2690 ns 2760 ns 3270 ns 3590 ns 6530 ns 5900 ns 6340 ns 5410 ns 5400 ns 5420 ns 5670 ns 5410 ns 5420 ns Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Mark Wagner Priority: Critical Attachments: PIG-3325.2.patch, PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3325: --- Attachment: PIG-3325.3.patch Slight update -- resetting all counters on clear(), and getting rid of an unnecessarily long 10K tuple test. Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Mark Wagner Priority: Critical Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13695279#comment-13695279 ] Dmitriy V. Ryaboy commented on PIG-3015: +1 if we find more stuff, we can open other jiras. Let's get this into trunk. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-10.patch, PIG-3015-11.patch, PIG-3015-12.patch, PIG-3015-20May2013.diff, PIG-3015-22June2013.diff, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688911#comment-13688911 ] Dmitriy V. Ryaboy commented on PIG-3325: [~mwagner] I was loading complex thrift structures that had bags in them. With old code (all bags register with SMM) this led to tons of weak references that needed to be cleaned out by the SMM; new code fixed that, but apparently created this other problem (which in practice on our workloads is not significant.. but your workloads may be different). Looking forward to Rohini's patch. Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Mark Wagner Priority: Critical Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689465#comment-13689465 ] Dmitriy V. Ryaboy commented on PIG-3325: What if instead of figuring out size based on the first 100 elements, we sampled first, 11th, 21st, etc until we get 100 samples? Would help with small bags (where accuracy of estimate doesn't matter as much). Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Mark Wagner Priority: Critical Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686100#comment-13686100 ] Dmitriy V. Ryaboy commented on PIG-3325: The previous behavior (having SMM check all bags) was pretty bad, it caused significant sudden delays if the data you were loading had bags in it. We observed pretty good speed gains for those use cases once we got rid of mandatory bag registration. Also got rid of a few memory leaks while we were in there, and the linked list maintenance overhead in SMM. Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Mark Wagner Priority: Critical Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679862#comment-13679862 ] Dmitriy V. Ryaboy commented on PIG-3325: [~mwagner] thanks for catching this perf regression. I only had time for a cursory look today -- why is the existing code O(n)? Seems like it sampled up to 100 elements and no more, so it's constant (once n=100). Seems to me like all that materially changed was that you added the sampling bit to add(). Unfortunately, a number of Bags override add() (see my notes in PIG-2923), which makes doing this in the default add() of the abstract function unreliable. Seems to me like a better approach would be to tackle the fact that for every time that getMemorySize() is called while there are fewer than 100 elements, we iterate over the whole bag (which is what you mean by O(n)?). We can do this by jumping directly to the mLastContentsSize'th element in the Bag, if we know the structure, or at least iterate to it without calling getMemorySize(), and then add to our running avg, rather than recomputing it. So, no resetting aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when sampling, just ignoring the first mLastContentsSize in the iterator. Thoughts? Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Mark Wagner Priority: Critical Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (PIG-3325) Adding a tuple to a bag is slow
[ https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679862#comment-13679862 ] Dmitriy V. Ryaboy edited comment on PIG-3325 at 6/10/13 8:23 PM: - [~mwagner] thanks for catching this perf regression. I only had time for a cursory look today -- why is the existing code O(N)? Seems like it sampled up to 100 elements and no more, so it's constant (once n=100). Seems to me like all that materially changed was that you added the sampling bit to add(). Unfortunately, a number of Bags override add() (see my notes in PIG-2923), which makes doing this in the default add() of the abstract function unreliable. Seems to me like a better approach would be to tackle the fact that for every time that getMemorySize() is called while there are fewer than 100 elements, we iterate over the whole bag (which is what you mean by O(N)?). We can do this by jumping directly to the mLastContentsSize'th element in the Bag, if we know the structure, or at least iterate to it without calling getMemorySize(), and then add to our running avg, rather than recomputing it. So, no resetting aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when sampling, just ignoring the first mLastContentsSize in the iterator. Thoughts? was (Author: dvryaboy): [~mwagner] thanks for catching this perf regression. I only had time for a cursory look today -- why is the existing code O(n)? Seems like it sampled up to 100 elements and no more, so it's constant (once n=100). Seems to me like all that materially changed was that you added the sampling bit to add(). Unfortunately, a number of Bags override add() (see my notes in PIG-2923), which makes doing this in the default add() of the abstract function unreliable. Seems to me like a better approach would be to tackle the fact that for every time that getMemorySize() is called while there are fewer than 100 elements, we iterate over the whole bag (which is what you mean by O(n)?). We can do this by jumping directly to the mLastContentsSize'th element in the Bag, if we know the structure, or at least iterate to it without calling getMemorySize(), and then add to our running avg, rather than recomputing it. So, no resetting aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when sampling, just ignoring the first mLastContentsSize in the iterator. Thoughts? Adding a tuple to a bag is slow --- Key: PIG-3325 URL: https://issues.apache.org/jira/browse/PIG-3325 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1, 0.11.2 Reporter: Mark Wagner Assignee: Mark Wagner Priority: Critical Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch The time it takes to add a tuple to a bag has increased significantly, causing some jobs to take about 50x longer compared to 0.10.1. I've tracked this down to PIG-2923, which has made adding a tuple heavier weight (it now includes some memory estimation). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3341) Improving performance of loading datetime values
[ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673780#comment-13673780 ] Dmitriy V. Ryaboy commented on PIG-3341: I don't think we are completely consistent, but turning invalid into null has been pretty standard. My personal preference is also to increment a counter for # of such conversions, and to log the first N occurrences (when N errors are encountered, log something to the effect of not logging this error any more because there's so much of it.) Improving performance of loading datetime values Key: PIG-3341 URL: https://issues.apache.org/jira/browse/PIG-3341 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.11.1 Reporter: pat chan Priority: Minor Fix For: 0.12, 0.11.2 The performance of loading datetime values can be improved by about 25% by moving a single line in ToDate.java: public static DateTimeZone extractDateTimeZone(String dtStr) { Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$);; should become: static Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$); public static DateTimeZone extractDateTimeZone(String dtStr) { There is no need to recompile the regular expression for every value. I'm not sure if this function is ever called concurrently, but Pattern objects are thread-safe anyways. As a test, I created a file of 10M timestamps: for i in 0..1000 puts '2000-01-01T00:00:00+23' end I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is null; dump B; Before the change it took 160s. After the change, the script took 120s. Another performance improvement can be made for invalid datetime values. If a datetime value is invalid, an exception is created and thrown, which is a costly way to fail a validity check. To test the performance impact, I created 10M invalid datetime values: for i in 0..1000 puts '2000-99-01T00:00:00+23' end In this test, the regex pattern was always recompiled. I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is not null; dump B; The script took 190s. I understand this could be considered an edge case and might not be worth changing. However, if there are use cases where invalid dates are part of normal processing, then you might consider fixing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3198) Let users use any function from PigType - PigType as if it were builtlin
[ https://issues.apache.org/jira/browse/PIG-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13636681#comment-13636681 ] Dmitriy V. Ryaboy commented on PIG-3198: Please add docs! Let users use any function from PigType - PigType as if it were builtlin - Key: PIG-3198 URL: https://issues.apache.org/jira/browse/PIG-3198 Project: Pig Issue Type: Bug Reporter: Jonathan Coveney Assignee: Jonathan Coveney Fix For: 0.12 Attachments: PIG-3198-0.patch, PIG-3198-1.patch, PIG-3198-apache_header.patch This idea is an extension of PIG-2643. Ideally, someone should be able to call any function currently registered in Pig as if it were builtin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3284) Document PIG-3198 and PIG-2643
[ https://issues.apache.org/jira/browse/PIG-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-3284: -- Assignee: Jonathan Coveney :-) Document PIG-3198 and PIG-2643 -- Key: PIG-3284 URL: https://issues.apache.org/jira/browse/PIG-3284 Project: Pig Issue Type: Task Reporter: Jonathan Coveney Assignee: Jonathan Coveney These improvements are quite useful, but only if people know that they exist. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3267) HCatStorer fail in limit query
[ https://issues.apache.org/jira/browse/PIG-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629601#comment-13629601 ] Dmitriy V. Ryaboy commented on PIG-3267: Should we apply this to 0.11 too? HCatStorer fail in limit query -- Key: PIG-3267 URL: https://issues.apache.org/jira/browse/PIG-3267 Project: Pig Issue Type: Bug Affects Versions: 0.9.2, 0.10.1, 0.11.1 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.12 Attachments: PIG-3267-1.patch The following query fail: {code} data = LOAD 'student.txt' as (name:chararray, age:int, gpa:double); data_limited = limit data 10; samples = foreach data_limited generate age as number; store samples into 'samples' using org.apache.hcatalog.pig.HCatStorer('part_dt=20130101T01T36'); {code} Error happens before launching the second job. Error message: {code} Message: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:8020/user/hive/warehouse/samples/part_dt=20130101T01T36 already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.hcatalog.mapreduce.FileOutputFormatContainer.checkOutputSpecs(FileOutputFormatContainer.java:135) at org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:72) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecsHelper(PigOutputFormat.java:207) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:188) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157) at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134) at java.lang.Thread.run(Thread.java:680) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:257) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3267) HCatStorer fail in limit query
[ https://issues.apache.org/jira/browse/PIG-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629603#comment-13629603 ] Dmitriy V. Ryaboy commented on PIG-3267: (+1) HCatStorer fail in limit query -- Key: PIG-3267 URL: https://issues.apache.org/jira/browse/PIG-3267 Project: Pig Issue Type: Bug Affects Versions: 0.9.2, 0.10.1, 0.11.1 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.12 Attachments: PIG-3267-1.patch The following query fail: {code} data = LOAD 'student.txt' as (name:chararray, age:int, gpa:double); data_limited = limit data 10; samples = foreach data_limited generate age as number; store samples into 'samples' using org.apache.hcatalog.pig.HCatStorer('part_dt=20130101T01T36'); {code} Error happens before launching the second job. Error message: {code} Message: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:8020/user/hive/warehouse/samples/part_dt=20130101T01T36 already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.hcatalog.mapreduce.FileOutputFormatContainer.checkOutputSpecs(FileOutputFormatContainer.java:135) at org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:72) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecsHelper(PigOutputFormat.java:207) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:188) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157) at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134) at java.lang.Thread.run(Thread.java:680) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:257) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621897#comment-13621897 ] Dmitriy V. Ryaboy commented on PIG-2769: Didn't see earlier that this only went into trunk (thanks [~knoguchi] for pointing this out!). We should put this into 0.11 branch, maybe there will be an 0.11.2 before 12 comes out. a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Assignee: Nick White Fix For: 0.12 Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, PIG-2769.2.patch, TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-3151) No documentation for Pig 0.10.1
[ https://issues.apache.org/jira/browse/PIG-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy resolved PIG-3151. Resolution: Won't Fix Release Note: we're past this now.. resolving so I can release the release in jira No documentation for Pig 0.10.1 --- Key: PIG-3151 URL: https://issues.apache.org/jira/browse/PIG-3151 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.10.1 Reporter: Russell Jurney Assignee: Daniel Dai Priority: Critical Fix For: 0.10.1 http://pig.apache.org/docs/r0.10.1/start.html is missing! http://pig.apache.org/docs/r0.10.0/start.html is there. Are there no docs for 0.10.1? Arg! :) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-2769: --- Fix Version/s: 0.11.2 a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Assignee: Nick White Fix For: 0.12, 0.11.2 Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, PIG-2769.2.patch, TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622756#comment-13622756 ] Dmitriy V. Ryaboy commented on PIG-2769: in 0.11 branch now. a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Assignee: Nick White Fix For: 0.12, 0.11.2 Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, PIG-2769.2.patch, TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3264) mvn signanddeploy target broken for pigunit, pigsmoke and piggybank
[ https://issues.apache.org/jira/browse/PIG-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3264: --- Fix Version/s: 0.11.2 mvn signanddeploy target broken for pigunit, pigsmoke and piggybank --- Key: PIG-3264 URL: https://issues.apache.org/jira/browse/PIG-3264 Project: Pig Issue Type: Bug Reporter: Bill Graham Assignee: Bill Graham Fix For: 0.11.2 Attachments: PIG_3264.1.patch, PIG_3264_branch11.1.patch Build fails with: {noformat} [artifact:deploy] Invalid reference: 'pigunit' {noformat} Patch on the way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3222) New UDFContextSignature assignments in Pig 0.11 breaks HCatalog.HCatStorer
[ https://issues.apache.org/jira/browse/PIG-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616999#comment-13616999 ] Dmitriy V. Ryaboy commented on PIG-3222: This is pretty confusing. Any ideas on how to fix this? Can we get away from the whole instantiation thing, and maybe keep an object registry? New UDFContextSignature assignments in Pig 0.11 breaks HCatalog.HCatStorer --- Key: PIG-3222 URL: https://issues.apache.org/jira/browse/PIG-3222 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11 Reporter: Feng Peng Labels: hcatalog Attachments: PigStorerDemo.java Pig 0.11 assigns different UDFContextSignature for different invocations of the same load/store statement. This change breaks the HCatStorer which assumes all front-end and back-end invocations of the same store statement has the same UDFContextSignature so that it can read the previously stored information correctly. The related HCatalog code is in https://svn.apache.org/repos/asf/incubator/hcatalog/branches/branch-0.5/hcatalog-pig-adapter/src/main/java/org/apache/hcatalog/pig/HCatStorer.java (the setStoreLocation() function). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3258) Patch to allow MultiStorage to use more than one index to generate output tree
[ https://issues.apache.org/jira/browse/PIG-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611467#comment-13611467 ] Dmitriy V. Ryaboy commented on PIG-3258: please generate patch against the project root. Patch to allow MultiStorage to use more than one index to generate output tree -- Key: PIG-3258 URL: https://issues.apache.org/jira/browse/PIG-3258 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joel Fouse Priority: Minor Labels: piggybank I have made a patch to enable MultiStorage to handle multiple tuple indexes, rather than only one, for generating the output directory structure. Before I submit it, though, I need to know if I should generate the patch from /contrib/piggybank/java where I've been compiling and unit testing, or back at the project root. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611472#comment-13611472 ] Dmitriy V. Ryaboy commented on PIG-2586: Do we need this given Ambrose (and from what I hear, Ambari)? What is the difference between what this proposes and what Ambrose does? https://github.com/twitter/ambrose There is an Ambrose patch to add inner plans, too: https://github.com/twitter/ambrose/issues/62 A better plan/data flow visualizer -- Key: PIG-2586 URL: https://issues.apache.org/jira/browse/PIG-2586 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Labels: gsoc2013 Pig supports a dot graph style plan to visualize the logical/physical/mapreduce plan (explain with -dot option, see http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). However, dot graph takes extra step to generate the plan graph and the quality of the output is not good. It's better we can implement a better visualizer for Pig. It should: 1. show operator type and alias 2. turn on/off output schema 3. dive into foreach inner plan on demand 4. provide a way to show operator source code, eg, tooltip of an operator (plan don't currently have this information, but you can assume this is in place) 5. besides visualize logical/physical/mapreduce plan, visualize the script itself is also useful 6. may rely on some java graphic library such as Swing This is a candidate project for Google summer of code 2013. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611478#comment-13611478 ] Dmitriy V. Ryaboy commented on PIG-2586: It does with the linked patch (it also visualizes the MR plan, without details of what's happening inside the map or reduce stage, without the patch). A better plan/data flow visualizer -- Key: PIG-2586 URL: https://issues.apache.org/jira/browse/PIG-2586 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Labels: gsoc2013 Pig supports a dot graph style plan to visualize the logical/physical/mapreduce plan (explain with -dot option, see http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). However, dot graph takes extra step to generate the plan graph and the quality of the output is not good. It's better we can implement a better visualizer for Pig. It should: 1. show operator type and alias 2. turn on/off output schema 3. dive into foreach inner plan on demand 4. provide a way to show operator source code, eg, tooltip of an operator (plan don't currently have this information, but you can assume this is in place) 5. besides visualize logical/physical/mapreduce plan, visualize the script itself is also useful 6. may rely on some java graphic library such as Swing This is a candidate project for Google summer of code 2013. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2586) A better plan/data flow visualizer
[ https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611490#comment-13611490 ] Dmitriy V. Ryaboy commented on PIG-2586: Hm I guess we can add logical plan if we want -- just need to feed it to the PPNL somehow. Ambrose is pretty separate from Pig specifics, if you give it a dag, it'll draw it. Do people use the logical plan to diagnose issues? I don't think I have had to do that yet. A better plan/data flow visualizer -- Key: PIG-2586 URL: https://issues.apache.org/jira/browse/PIG-2586 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Labels: gsoc2013 Pig supports a dot graph style plan to visualize the logical/physical/mapreduce plan (explain with -dot option, see http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). However, dot graph takes extra step to generate the plan graph and the quality of the output is not good. It's better we can implement a better visualizer for Pig. It should: 1. show operator type and alias 2. turn on/off output schema 3. dive into foreach inner plan on demand 4. provide a way to show operator source code, eg, tooltip of an operator (plan don't currently have this information, but you can assume this is in place) 5. besides visualize logical/physical/mapreduce plan, visualize the script itself is also useful 6. may rely on some java graphic library such as Swing This is a candidate project for Google summer of code 2013. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3254) Fail a failed Pig script quicker
[ https://issues.apache.org/jira/browse/PIG-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608070#comment-13608070 ] Dmitriy V. Ryaboy commented on PIG-3254: Can I add a request for whoever will work on this ticket? Right now we die with MR Job Failed but don't say which job. In cases when multiple jobs are launched, one of them fails, the other ones are killed, and users find it hard to figure out which job was the cause of all badness. It would be nice to print out the job id of the failed job. Fail a failed Pig script quicker Key: PIG-3254 URL: https://issues.apache.org/jira/browse/PIG-3254 Project: Pig Issue Type: Improvement Reporter: Daniel Dai Fix For: 0.12 Credit to [~asitecn]. Currently Pig could launch several mapreduce job simultaneously. When one mapreduce job fail, we need to wait for simultaneous mapreduce job finish. In addition, we could potentially launch additional jobs which is doomed to fail. However, this is unnecessary in some cases: * If stop.on.failure==true, we can kill parallel jobs, and fail the whole script * If stop.on.failure==false, and no store could success, we can also kill parallel jobs, and fail the whole script Consider simultaneous jobs may take a long time to finish, this could significantly improve the turn around in some cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3132) NPE when illustrating a relation with HCatLoader
[ https://issues.apache.org/jira/browse/PIG-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604929#comment-13604929 ] Dmitriy V. Ryaboy commented on PIG-3132: +1 NPE when illustrating a relation with HCatLoader - Key: PIG-3132 URL: https://issues.apache.org/jira/browse/PIG-3132 Project: Pig Issue Type: Bug Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.12 Attachments: PIG-3132-1.patch Get NPE exception when illustrate a relation with HCatLoader: {code} A = LOAD 'studenttab10k' USING org.apache.hcatalog.pig.HCatLoader(); illustrate A; {code} Exception: {code} java.lang.NullPointerException at org.apache.hcatalog.pig.PigHCatUtil.transformToTuple(PigHCatUtil.java:274) at org.apache.hcatalog.pig.PigHCatUtil.transformToTuple(PigHCatUtil.java:238) at org.apache.hcatalog.pig.HCatBaseLoader.getNext(HCatBaseLoader.java:61) at org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:210) at org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:190) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:129) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:194) at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257) at org.apache.pig.pen.ExampleGenerator.readBaseData(ExampleGenerator.java:222) at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:154) at org.apache.pig.PigServer.getExamples(PigServer.java:1245) at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:698) at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:591) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:306) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:67) {code} HCatalog side is tracked with HCATALOG-163. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3208) [zebra] TFile should not set io.compression.codec.lzo.buffersize
[ https://issues.apache.org/jira/browse/PIG-3208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605765#comment-13605765 ] Dmitriy V. Ryaboy commented on PIG-3208: [~daijy] why wouldn't we commit fixes provided by community? [zebra] TFile should not set io.compression.codec.lzo.buffersize Key: PIG-3208 URL: https://issues.apache.org/jira/browse/PIG-3208 Project: Pig Issue Type: Bug Reporter: Eugene Koontz Assignee: Eugene Koontz Attachments: PIG-3208.patch In contrib/zebra/src/java/org/apache/hadoop/zebra/tfile/Compression.java, the following occurs: {code} conf.setInt(io.compression.codec.lzo.buffersize, 64 * 1024); {code} This can cause the LZO decompressor, if called within the context of reading TFiles, to return with an error code when trying to uncompress LZO-compressed data, if the data's compressed size is too large to fit in 64 * 1024 bytes. For example, the Hadoop-LZO code uses a different default value (256 * 1024): https://github.com/twitter/hadoop-lzo/blob/master/src/java/com/hadoop/compression/lzo/LzoCodec.java#L185 This can lead to a case where, if data is compressed with a cluster where the default {{io.compression.codec.lzo.buffersize}} = 256*1024 is used, then code that tries to read this data by using Pig's zebra, the Mapper will exit with code 134 because the LZO compressor returns a -4 (which encodes the LZO C library error LZO_E_INPUT_OVERRUN) when trying to uncompress the data. The stack trace of such a case is shown below: {code} 2013-02-17 14:47:50,709 INFO com.hadoop.compression.lzo.LzoCodec: Creating stream for compressor: com.hadoop.compression.lzo.LzoCompressor@6818c458 with bufferSize: 262144 2013-02-17 14:47:50,849 INFO org.apache.hadoop.io.compress.CodecPool: Paying back codec: com.hadoop.compression.lzo.LzoCompressor@6818c458 2013-02-17 14:47:50,849 INFO org.apache.hadoop.mapred.MapTask: Finished spill 3 2013-02-17 14:47:50,857 INFO org.apache.hadoop.io.compress.CodecPool: Borrowing codec: com.hadoop.compression.lzo.LzoCompressor@6818c458 2013-02-17 14:47:50,866 INFO com.hadoop.compression.lzo.LzoCodec: Creating stream for compressor: com.hadoop.compression.lzo.LzoCompressor@6818c458 with bufferSize: 262144 2013-02-17 14:47:50,879 INFO org.apache.hadoop.io.compress.CodecPool: Paying back codec: com.hadoop.compression.lzo.LzoCompressor@6818c458 2013-02-17 14:47:50,879 INFO org.apache.hadoop.mapred.MapTask: Finished spill 4 2013-02-17 14:47:50,887 INFO org.apache.hadoop.mapred.Merger: Merging 5 sorted segments 2013-02-17 14:47:50,890 INFO org.apache.hadoop.io.compress.CodecPool: Borrowing codec: com.hadoop.compression.lzo.LzoDecompressor@66a23610 2013-02-17 14:47:50,891 INFO com.hadoop.compression.lzo.LzoDecompressor: calling decompressBytesDirect with buffer with: position: 0 and limit: 262144 2013-02-17 14:47:50,891 INFO com.hadoop.compression.lzo.LzoDecompressor: read: 245688 bytes from decompressor. 2013-02-17 14:47:50,891 INFO org.apache.hadoop.io.compress.CodecPool: Borrowing codec: com.hadoop.compression.lzo.LzoDecompressor@43684706 2013-02-17 14:47:50,892 INFO com.hadoop.compression.lzo.LzoDecompressor: calling decompressBytesDirect with buffer with: position: 0 and limit: 65536 2013-02-17 14:47:50,895 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2013-02-17 14:47:50,897 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.InternalError: lzo1x_decompress returned: -4 at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method) at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:307) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:341) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:371) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:387) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:420) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1548) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at
[jira] [Commented] (PIG-2388) Make shim for Hadoop 0.20 and 0.23 support dynamic
[ https://issues.apache.org/jira/browse/PIG-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605849#comment-13605849 ] Dmitriy V. Ryaboy commented on PIG-2388: Hive does this, and back in the day there was a patch that did this for Pig and hadoop 18 vs hadoop 20. Should be doable, though it'll take work.. Make shim for Hadoop 0.20 and 0.23 support dynamic -- Key: PIG-2388 URL: https://issues.apache.org/jira/browse/PIG-2388 Project: Pig Issue Type: Improvement Affects Versions: 0.9.2, 0.10.0 Reporter: Thomas Weise Fix For: 0.9.2, 0.10.0 Attachments: PIG-2388_branch-0.9.patch We need a single Pig installation that works with both Hadoop versions. The current shim implementation assumes different builds for each version. We can solve this statically through internal build/installation system or by making the shim dynamic so that pig.jar will work on both version with runtime detection. Attached patch is to convert the static shims into a shim interface with 2 implementations, each of which will be compiled against the respective Hadoop version and included into single pig.jar (similar to what Hive does). The default build behavior remains unchanged, only the shim for ${hadoopversion} will be compiled. Both shims can be built via: ant -Dbuild-all-shims=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3194) Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2
[ https://issues.apache.org/jira/browse/PIG-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603984#comment-13603984 ] Dmitriy V. Ryaboy commented on PIG-3194: +1 Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2 --- Key: PIG-3194 URL: https://issues.apache.org/jira/browse/PIG-3194 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Kai Londenberg Assignee: Prashant Kommireddi Fix For: 0.11.1 Attachments: PIG-3194_2.patch, PIG-3194.patch The changes to ObjectSerializer.java in the following commit http://svn.apache.org/viewvc?view=revisionrevision=1403934 break compatibility with Hadoop 0.20.2 Clusters. The reason is, that the code uses methods from Apache Commons Codec 1.4 - which are not available in Apache Commons Codec 1.3 which is shipping with Hadoop 0.20.2. The offending methods are Base64.decodeBase64(String) and Base64.encodeBase64URLSafeString(byte[]) If I revert these changes, Pig 0.11.0 candidate 2 works well with our Hadoop 0.20.2 Clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3245) Documentation about HBaseStorage
[ https://issues.apache.org/jira/browse/PIG-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3245: --- Status: Patch Available (was: Open) Documentation about HBaseStorage Key: PIG-3245 URL: https://issues.apache.org/jira/browse/PIG-3245 Project: Pig Issue Type: Improvement Components: documentation Affects Versions: 0.11 Reporter: Daisuke Kobayashi Attachments: PIG-3245.patch HBaseStorage always disable split combination. It should be documented explicitly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3241) ConcurrentModificationException in POPartialAgg
[ https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600602#comment-13600602 ] Dmitriy V. Ryaboy commented on PIG-3241: I think I have a clean fix, Lohit and I are testing. ConcurrentModificationException in POPartialAgg --- Key: PIG-3241 URL: https://issues.apache.org/jira/browse/PIG-3241 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Lohit Vijayarenu Priority: Blocker Fix For: 0.12, 0.11.1 While running few PIG scripts against Hadoop 2.0, I see consistently see ConcurrentModificationException {noformat} at java.util.HashMap$HashIterator.remove(HashMap.java:811) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153) {noformat} It looks like there is rawInputMap is being modified while elements are removed from it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3241) ConcurrentModificationException in POPartialAgg
[ https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-3241: -- Assignee: Dmitriy V. Ryaboy ConcurrentModificationException in POPartialAgg --- Key: PIG-3241 URL: https://issues.apache.org/jira/browse/PIG-3241 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Lohit Vijayarenu Assignee: Dmitriy V. Ryaboy Priority: Blocker Fix For: 0.12, 0.11.1 While running few PIG scripts against Hadoop 2.0, I see consistently see ConcurrentModificationException {noformat} at java.util.HashMap$HashIterator.remove(HashMap.java:811) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153) {noformat} It looks like there is rawInputMap is being modified while elements are removed from it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3241) ConcurrentModificationException in POPartialAgg
[ https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-3241: --- Attachment: PIG-3241.patch Attaching patch. Rather than synchronize all memory access, I decided to simply avoid concurrent access all together. spill(), called by Spillable Memory Manager, used to set up the iterator used for spilling - that involved looking at the primary and secondary maps, applying the combiner to them, doing all kinds of things -- all in the SMM thread. Instead, we now only set the doSpill flag in spill(), and do the work in the main thread, which now is the only thread that can modify iterators and hashmaps. Most of this patch is just whitespace changes :). ConcurrentModificationException in POPartialAgg --- Key: PIG-3241 URL: https://issues.apache.org/jira/browse/PIG-3241 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Lohit Vijayarenu Assignee: Dmitriy V. Ryaboy Priority: Blocker Fix For: 0.12, 0.11.1 Attachments: PIG-3241.patch While running few PIG scripts against Hadoop 2.0, I see consistently see ConcurrentModificationException {noformat} at java.util.HashMap$HashIterator.remove(HashMap.java:811) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153) {noformat} It looks like there is rawInputMap is being modified while elements are removed from it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600645#comment-13600645 ] Dmitriy V. Ryaboy commented on PIG-3015: Serious question: is there a reason to put this in Pig rather than keep elsewhere, where you can iterate without being tied to Pig's release cycle? Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-10.patch, PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira