from:"Alan Gates"


[ 
https://issues.apache.org/jira/browse/PIG-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848068#comment-13848068
 ] 

Alan Gates commented on PIG-3622:
-

Have you tested that this works ok with the rest of the code?  Does something 
remove the (unnecessary) cast?  If not it seems like there will be issues, as 
there is no binary cast in Pig.  

 Allow casting bytearray fileds to bytearray type
 

 Key: PIG-3622
 URL: https://issues.apache.org/jira/browse/PIG-3622
 Project: Pig
  Issue Type: Improvement
 Environment: 0.12
Reporter: Redis Liu
Priority: Minor
 Attachments: 3622-v1.patch


 test.pig:
 AA = load '1.txt' USING PigStorage(' ') as (a:bytearray, b:chararray, 
 c:chararray);
 AA1 = filter AA by a == '1';
 AA2 = foreach AA1 generate *, ( a == '1' ? a : null ) as myd;
 dump AA2;
 the INPUT file 1.txt is as below:
 a b c
 1 2 3
 4 5 6
 2 3 4
 b a c
 c a b
 run the pig script in this way:
 # pig -x local test.pig
 It'll fail with this error message:
 Pig Stack Trace
 ---
 ERROR 1051: Cannot cast to bytearray
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias AA2
   at org.apache.pig.PigServer.openIterator(PigServer.java:882)
   at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:774)
   at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:607)
   at org.apache.pig.Main.main(Main.java:156)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:200)
 Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias AA2
   at org.apache.pig.PigServer.storeEx(PigServer.java:984)
   at org.apache.pig.PigServer.store(PigServer.java:944)
   at org.apache.pig.PigServer.openIterator(PigServer.java:857)
   ... 12 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1059: 
 file test.pig, line 7, column 6 Problem while reconciling output schema of 
 ForEach
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.throwTypeCheckerException(TypeCheckingRelVisitor.java:142)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:182)
   at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1733)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1710)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1411)
   at org.apache.pig.PigServer.storeEx(PigServer.java:979)
   ... 14 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 2216: 
 file test.pig, line 7, column 34 Problem getting fieldSchema for (Name: 
 Cast Type: bytearray Uid: 17)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingExpVisitor.visit(TypeCheckingExpVisitor.java:603)
   at 
 org.apache.pig.newplan.logical.expression.BinCondExpression.accept(BinCondExpression.java:84)
   at 
 org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visitExpressionPlan(TypeCheckingRelVisitor.java:191)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:157)
   at 
 org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:242)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:174)
   ... 21 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1051: Cannot cast to bytearray
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingExpVisitor.visit(TypeCheckingExpVisitor.java:494)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingExpVisitor.insertCast

[jira] [Updated] (PIG-3622) Allow casting bytearray fileds to bytearray type


 [ 
https://issues.apache.org/jira/browse/PIG-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3622:


Assignee: Redis Liu

 Allow casting bytearray fileds to bytearray type
 

 Key: PIG-3622
 URL: https://issues.apache.org/jira/browse/PIG-3622
 Project: Pig
  Issue Type: Improvement
 Environment: 0.12
Reporter: Redis Liu
Assignee: Redis Liu
Priority: Minor
 Attachments: 3622-v1.patch


 test.pig:
 AA = load '1.txt' USING PigStorage(' ') as (a:bytearray, b:chararray, 
 c:chararray);
 AA1 = filter AA by a == '1';
 AA2 = foreach AA1 generate *, ( a == '1' ? a : null ) as myd;
 dump AA2;
 the INPUT file 1.txt is as below:
 a b c
 1 2 3
 4 5 6
 2 3 4
 b a c
 c a b
 run the pig script in this way:
 # pig -x local test.pig
 It'll fail with this error message:
 Pig Stack Trace
 ---
 ERROR 1051: Cannot cast to bytearray
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias AA2
   at org.apache.pig.PigServer.openIterator(PigServer.java:882)
   at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:774)
   at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:607)
   at org.apache.pig.Main.main(Main.java:156)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:200)
 Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias AA2
   at org.apache.pig.PigServer.storeEx(PigServer.java:984)
   at org.apache.pig.PigServer.store(PigServer.java:944)
   at org.apache.pig.PigServer.openIterator(PigServer.java:857)
   ... 12 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1059: 
 file test.pig, line 7, column 6 Problem while reconciling output schema of 
 ForEach
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.throwTypeCheckerException(TypeCheckingRelVisitor.java:142)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:182)
   at 
 org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:76)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1733)
   at org.apache.pig.PigServer$Graph.compile(PigServer.java:1710)
   at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1411)
   at org.apache.pig.PigServer.storeEx(PigServer.java:979)
   ... 14 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 2216: 
 file test.pig, line 7, column 34 Problem getting fieldSchema for (Name: 
 Cast Type: bytearray Uid: 17)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingExpVisitor.visit(TypeCheckingExpVisitor.java:603)
   at 
 org.apache.pig.newplan.logical.expression.BinCondExpression.accept(BinCondExpression.java:84)
   at 
 org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visitExpressionPlan(TypeCheckingRelVisitor.java:191)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:157)
   at 
 org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:242)
   at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:174)
   ... 21 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1051: Cannot cast to bytearray
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingExpVisitor.visit(TypeCheckingExpVisitor.java:494)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingExpVisitor.insertCast(TypeCheckingExpVisitor.java:472)
   at 
 org.apache.pig.newplan.logical.visitor.TypeCheckingExpVisitor.visit(TypeCheckingExpVisitor.java:599)
   ... 30 more

[jira] [Updated] (PIG-3619) Provide XPath function


 [ 
https://issues.apache.org/jira/browse/PIG-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3619:


Assignee: Saad Patel

 Provide XPath function
 --

 Key: PIG-3619
 URL: https://issues.apache.org/jira/browse/PIG-3619
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Saad Patel
Assignee: Saad Patel
 Attachments: xpath.patch


 Xml is often loaded using XMLLoader with a record boundary tag as one of the 
 parameters. A common use case is to then extract data from those records. 
 XPath would allow those extractions to be done very easily. I'm  proposing a 
 patch that adds simple XPath support as a UDF.
 Example usage of this the XPath UDF would be:
 {code}
 extractions = FOREACH xmlrecords GENERATE XPath(record, 'book/author'), 
 XPath(record, 'book/title');
 {code}
 The proposed UDF also caches the last xml document. This is helpful for 
 improving performance when multiple consecutive xpath extractions on the same 
 xml document, such as the example above. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Resolved] (PIG-3619) Provide XPath function


 [ 
https://issues.apache.org/jira/browse/PIG-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-3619.
-

Resolution: Fixed

Patch checked in.  Thanks Saad.

 Provide XPath function
 --

 Key: PIG-3619
 URL: https://issues.apache.org/jira/browse/PIG-3619
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Saad Patel
Assignee: Saad Patel
 Attachments: xpath.patch


 Xml is often loaded using XMLLoader with a record boundary tag as one of the 
 parameters. A common use case is to then extract data from those records. 
 XPath would allow those extractions to be done very easily. I'm  proposing a 
 patch that adds simple XPath support as a UDF.
 Example usage of this the XPath UDF would be:
 {code}
 extractions = FOREACH xmlrecords GENERATE XPath(record, 'book/author'), 
 XPath(record, 'book/title');
 {code}
 The proposed UDF also caches the last xml document. This is helpful for 
 improving performance when multiple consecutive xpath extractions on the same 
 xml document, such as the example above. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PIG-3558) ORC support for Pig

2013-12-09 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843632#comment-13843632
 ] 

Alan Gates commented on PIG-3558:
-

+1.

 ORC support for Pig
 ---

 Key: PIG-3558
 URL: https://issues.apache.org/jira/browse/PIG-3558
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.13.0

 Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch


 Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PIG-3548) Allow pig to load multiple paths specified in a filenames.txt

2013-11-15 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824148#comment-13824148
 ] 

Alan Gates commented on PIG-3548:
-

Could you store the parameters in a file rather than specify them on the 
command line?  See http://pig.apache.org/docs/r0.12.0/cont.html#Parameter-Sub 
for details.

 Allow pig to load multiple paths specified in a filenames.txt
 -

 Key: PIG-3548
 URL: https://issues.apache.org/jira/browse/PIG-3548
 Project: Pig
  Issue Type: Improvement
Reporter: Madhavi Nadig

 I have a list of paths stored in a filenames.txt. I would like to load them 
 all using a single LOAD command. The paths don't conform to one or more 
 regexes, so they have to specified individually.
 So far I've used the -param option with pig to specify them. But it results 
 in an extremely long commandline and I'm afraid I wont be able to scale my 
 script.
 shell : pig -param read_paths=my-long-list-of-paths something.pig
 something.pig : requests = LOAD '$read_paths' USING PigStorage(',');



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Re: How do we determine 'stable' pig version?

2013-10-22 Thread Alan Gates

I don't think we should change our use of stable.  Our usage is in line with 
the Hadoop usage of the term in their releases.  To the best of our knowledge 
as Apache developers it is stable.  It passes all of the tests we have.  We 
have no criteria for deciding stability beyond this.

Alan.

On Oct 22, 2013, at 4:00 PM, Daniel Dai wrote:

 Yes, we can revisit. The question is how to determine the stability? 0.11.1
 is released for a while and should be considered stable, but actually it
 contains problem raised just recently. After we release 0.12.1, how soon
 should we declare it a stable release?
 
 Thanks,
 Daniel
 
 
 On Tue, Oct 22, 2013 at 2:25 PM, Koji Noguchi knogu...@yahoo-inc.comwrote:
 
 Thanks Daniel, Olga!  Keeping 3 versions would be nice.
 
 As for 'stable', can we revisit the definition?
 If it's *always* pointing to the latest release, I don't see the need for
 having this link(dir).
 Is it adding any value?
 
 Koji
 
 
 
 
 On Oct 22, 2013, at 1:43 PM, Daniel Dai da...@hortonworks.com wrote:
 
 That's totally make sense. Let's keep both download/documentation for 3
 versions.
 
 Thanks,
 Daniel
 
 
 On Tue, Oct 22, 2013 at 10:20 AM, Olga Natkovich onatkov...@yahoo.com
 wrote:
 
 Couple of suggestions:
 
 (1) I think we are trying to go for a more frequent release model and in
 that case it would make sense to keep perhaps 3 releases. Based on our
 experience at Yahoo, Pig 10 is the really stable release. We recently
 found
 a couple of critical bugs in 11 for which we posted patches. Also the
 community knows that we delayed a couple of key bugs in 12 till 12.1
 (2) Our documentation needs to be consistent with the number of releases
 we advertise as supported. Our docs currently go all the way to Pig 9.
 
 Olga
 
 
 
 On Tuesday, October 22, 2013 10:13 AM, Daniel Dai 
 da...@hortonworks.com
 wrote:
 
 Hi, Koji,
 Here is the criteria I use:
 (i) How do we determine how many releases to show on the front download
 page?
 We usually keep two most recent releases on the front page according to
 https://cwiki.apache.org/confluence/display/PIG/HowToRelease.
 
 (ii) How do we determine which release is considered 'stable' ?
 Here stable means passing all tests, peer reviewed. It does not mean
 production stable. Actually there is no way for us to know production
 stable after user download it, use it and gives feedback. That's why
 we
 will continue fixing bugs after major release. and make minor releases.
 
 Thanks,
 Daniel
 
 
 
 On Tue, Oct 22, 2013 at 9:45 AM, Koji Noguchi knogu...@yahoo-inc.com
 wrote:
 
 
 When I went to the pig release download page (through
 http://www.apache.org/dyn/closer.cgi/pig), I only saw 0.11.1 and 0.12
 available.
 I later learned that there is an 'archive' link(
 http://archive.apache.org/dist/pig/)  that list other versions (0.8 to
 0.10).
 
 Two questions.
 
 (i) How do we determine how many releases to show on the front download
 page?
 
 (ii) How do we determine which release is considered 'stable' ?
 
 I still consider the stable version to be 0.10.1 so I was surprised not
 to
 see that available on the front download page
 and even more surprised to see release 0.12 flagged as 'stable'.
 
 Koji
 
 
 
 
 
 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or
 entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the
 reader
 of this message is not the intended recipient, you are hereby notified
 that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender
 immediately
 and delete it from your system. Thank You.
 
 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified
 that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender
 immediately
 and delete it from your system. Thank You.
 
 
 
 -- 
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to 
 which it is addressed and may contain information that is confidential, 
 privileged and exempt from disclosure under applicable law. If the reader 
 of this message is not the intended recipient, you are hereby notified that 
 any printing, copying, dissemination, distribution, disclosure or 
 forwarding of this communication is strictly prohibited. If you have 
 received this communication in error, please contact the sender immediately 
 and delete it from your system. Thank

Re: [VOTE] Release Pig 0.12.0 (candidate 2)

2013-10-07 Thread Alan Gates

+1. Downloaded, ran commit-test, piggybank unit tests, tutorial, and simple
local mode smoke tests. Looked over the CHANGES, README, RELEASE_NOTES files
to make sure they looked reasonable.

Alan.

On Oct 7, 2013, at 12:28 PM, Daniel Dai wrote:

Hi,

I have created a candidate build for Pig 0.12.0.

Keys used to sign the release are available at
http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup

Please download, test, and try it out:

http://people.apache.org/~daijy/pig-0.12.0-candidate-2/http://people.apache.org/%7Edaijy/pig-0.12.0-candidate-0/

Should we release this? Vote closes on EOD this Thursday, Oct 10th.

Thanks,
Daniel

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

Re: [Discussion] Any thoughts on PIG-3457?

2013-09-30 Thread Alan Gates

We should separate out two separate concerns.  If I understand correctly we 
don't need any of these changes in 0.12.  So we should revert these patches 
from the 12 branch so that we can get it released quickly in a backwards 
compatible way.  

We will then have plenty of time to discuss the separate question of how we 
proceed going forward (deprecated APIs or new APIs).

Alan.

On Sep 30, 2013, at 11:45 AM, Cheolsoo Park wrote:

 Hi Jeremy,
 
 What you're saying makes sense, and patch is welcome. ;-) But complexity
 comes from that there are many classes that are associated with one
 another, and it seems necessary to bring back all of them together in order
 to provide full backward compatibility.
 
 After spending many hours on the weekend, I concluded that adding more
 workarounds (classes, methods, packages, etc) to the current code makes it
 only less maintainable and readable. So I prefer a simpler approach.
 
 For eg, we can just publish two jars - pig.jar w/ old API and pig-new.jar
 w/ new API - maybe not in 0.12 but in 0.13. Since we already have a
 tez-branch, we can use it to manage the new version of classes. Then, users
 can switch to pig-new.jar gradually in 0.13 and 0.14. When we finally merge
 tez-branch into trunk, we can publish a single jar again.
 
 Of course, this is not trivial either because we have to maintain two
 branches. But I feel that managing two branches independently is easier
 than maintaining all sorts of workarounds for backward compatibility in the
 source code. In addition, we will have more flexibility in terms of
 designing new API because we will be completely free from backward
 compatibility. No?
 
 Thanks,
 Cheolsoo
 
 On Mon, Sep 30, 2013 at 11:12 AM, Jeremy Karn jk...@mortardata.com wrote:
 
 What about the option of leaving all of the MR specific logic in the
 original classes but marking those methods as deprecated and telling people
 to switch to using a MR specific object that extends the original class.
 So for example:
 
 JobStats - Reverted to being as it was before PIG-3419 but with all MR
 specific logic deprecated.
 MRJobStats - Would just extend JobStats.
 
 If we did this, external software could switch their code from using
 JobStats to MRJobStats at their own pace and without breaking against any
 specific version of Pig.  After a few versions the MR specific logic could
 be removed from JobStats and pushed into MRJobStats and it shouldn't break
 anything for people that had made that change.
 
 I'm not familiar with all of the changes in PIG-3419 so this might not work
 everywhere.
 
 
 On Mon, Sep 30, 2013 at 1:43 PM, Cheolsoo Park piaozhe...@gmail.com
 wrote:
 
 To be specific, we will need to revert all the following commits in
 order:
 
 
 commit ad1b87d4ba073680ad0a7fc8c76baeb8b611c982
 Author: Cheolsoo Park cheol...@apache.org
 Date:   Fri Sep 20 22:47:29 2013 +
 
PIG-3471: Add a base abstract class for ExecutionEngine (cheolsoo)
 
git-svn-id:
 
 
 https://svn.apache.org/repos/asf/pig/trunk@152516513f79535-47bb-0310-9956-ffa450edef68
 
 commit 4305a6f4737d07396ae13fd95d7c1da7933b38a1
 Author: Jianyong Dai da...@apache.org
 Date:   Wed Sep 18 19:09:49 2013 +
 
PIG-3457: Provide backward compatibility for PigStatsUtil and
 JobStats
 
git-svn-id:
 
 
 https://svn.apache.org/repos/asf/pig/trunk@152453213f79535-47bb-0310-9956-ffa450edef68
 
 commit e85cf34c92713aa697a1cda7a9c2b3db139350f7
 Author: Cheolsoo Park cheol...@apache.org
 Date:   Wed Sep 18 15:37:58 2013 +
 
PIG-3464: Mark ExecType and ExecutionEngine interfaces as evolving
 (cheolsoo)
 
 commit fd8b7cdf9292b305f02386d560c25298ab492a0b
 Author: Cheolsoo Park cheol...@apache.org
 Date:   Fri Aug 30 20:04:29 2013 +
 
PIG-3419: Pluggable Execution Engine (achalsoni81 via cheolsoo)
 
git-svn-id:
 
 
 https://svn.apache.org/repos/asf/pig/trunk@151906213f79535-47bb-0310-9956-ffa450edef68
 
 
 
 
 On Mon, Sep 30, 2013 at 10:33 AM, Daniel Dai da...@hortonworks.com
 wrote:
 
 Thanks Cheolsoo! My opinion is provide backward compatibility for
 PigStats
 is a must, otherwise it could be a havoc. I imagine PigStats is widely
 used
 by Pig users via PigRunner and PPNL interface. People use PigStats to
 collect MR job details of the Pig job. Though PigStats is marked for
 Evolving, this is mostly for extending PigStats, not limiting PigStats
 as
 PIG-3419 did. Even if we really need to change, we need to very well
 communicate with users over time, Pig 0.12 is not an option.
 
 PIG-3457 is trying to provide a backward compatibility way for
 PigStats,
 but just like Cheolsoo said, it is far from ideal. I now tend to agree
 Rohini's suggestion on PIG-3419, rollback PIG-3419, until we find a
 better
 way. Seems PIG-3419 is a little premature. Besides the above mentioned
 PigStats issue, I've already found 2 additional issues:
 1. explain shows unoptimized logical plan instead of optimized one
 2. HangingJobKiller is removed
 
 How does others think?

[jira] [Commented] (PIG-3468) PIG-3123 breaks e2e test Jython_Diagnostics_2

2013-09-24 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776917#comment-13776917
 ] 

Alan Gates commented on PIG-3468:
-

+1

 PIG-3123 breaks e2e test Jython_Diagnostics_2
 -

 Key: PIG-3468
 URL: https://issues.apache.org/jira/browse/PIG-3468
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12.0

 Attachments: PIG-3468-1.patch


 PIG-3123 optimized TypeCastInserter by adding a castInserted flag for LOLoad 
 which do not need a LOForEach just to do the pruning. However, this flag is 
 also used in illustrate to visualize the output from the loader 
 (DisplayExamples:110). That's why Jython_Diagnostics_2 is broken.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Which file translates the program into a map reduce plan

2013-09-19 Thread Alan Gates

Checkout 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java

Alan.

On Sep 19, 2013, at 3:54 PM, Abdollahian Noghabi, Shadi wrote:

 Hi,
 
 I want to find which file in pig converts the physical plan into the map 
 reduce plan. Actually, I want to get some information out of the map reduce 
 plan, but I cannot find in which file it is located. I would be more than 
 happy if anyone could guide me where is the directory and the file.
 
 Thanks,
 Shadi


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

[jira] [Commented] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-09-17 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769786#comment-13769786
 ] 

Alan Gates commented on PIG-3255:
-

I gave my +1 above, so we're good from my viewpoint.

 Avoid extra byte array copy in streaming deserialize
 

 Key: PIG-3255
 URL: https://issues.apache.org/jira/browse/PIG-3255
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-3255-1.patch, PIG-3255-2.patch, PIG-3255-3.patch, 
 PIG-3255-4.patch, PIG-3255-5.patch


 PigStreaming.java:
  public Tuple deserialize(byte[] bytes) throws IOException {
 Text val = new Text(bytes);  
 return StorageUtil.textToTuple(val, fieldDel);
 }
 Should remove new Text(bytes) copy and construct the tuple directly from the 
 bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-09-12 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765732#comment-13765732
 ] 

Alan Gates commented on PIG-3255:
-

+1

 Avoid extra byte array copy in streaming deserialize
 

 Key: PIG-3255
 URL: https://issues.apache.org/jira/browse/PIG-3255
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-3255-1.patch, PIG-3255-2.patch, PIG-3255-3.patch


 PigStreaming.java:
  public Tuple deserialize(byte[] bytes) throws IOException {
 Text val = new Text(bytes);  
 return StorageUtil.textToTuple(val, fieldDel);
 }
 Should remove new Text(bytes) copy and construct the tuple directly from the 
 bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3333) Fix remaining Windows core unit test failures

2013-09-11 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764878#comment-13764878
 ] 

Alan Gates commented on PIG-:
-

+1

 Fix remaining Windows core unit test failures
 -

 Key: PIG-
 URL: https://issues.apache.org/jira/browse/PIG-
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG--1.patch, PIG--2.patch


 I combine a bunch of Windows unit test fixes into one patch to make things 
 cleaner. They all originated from obvious Windows/Unix inconsistencies, which 
 includes:
 1. Path separator inconsistency: / vs \
 2. Path component separator inconsistency: : vs ;
 3. volume: is not acceptable as URI
 4. Unix tools/commands (eg, bash, rm) does not exist in Windows
 5. .sh script need a .cmd companion in Windows
 6. \r\n vs \n as newline
 7. Environment variable use different name (USER vs USERNAME)
 8. File not closed, not an issue in Unix, but an issue in Windows (not able 
 to remove a open file)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-09-11 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764980#comment-13764980
 ] 

Alan Gates commented on PIG-3255:
-

I don't know if anyone is using StreamToPig either, but marking an interface as 
stable and then changing it without deprecation or anything isn't cool.  So no, 
I don't think this change is ok.

We could add the proposed function public Tuple deserialize(byte[] bytes, int 
offset, int length) throws IOException; to the interface and change Pig to 
call it if it's present or use the old one if not.  

 Avoid extra byte array copy in streaming deserialize
 

 Key: PIG-3255
 URL: https://issues.apache.org/jira/browse/PIG-3255
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-3255-1.patch, PIG-3255-2.patch, PIG-3255-3.patch


 PigStreaming.java:
  public Tuple deserialize(byte[] bytes) throws IOException {
 Text val = new Text(bytes);  
 return StorageUtil.textToTuple(val, fieldDel);
 }
 Should remove new Text(bytes) copy and construct the tuple directly from the 
 bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-09-11 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765123#comment-13765123
 ] 

Alan Gates commented on PIG-3255:
-

At compile time, but not at runtime.  At runtime Pig would need to reflect the 
class implementing StreamToPig and see if it contained a deserialize method 
that matches your new signature.  You could then pick which method to call 
based on that.  As Jeremy suggests, you could instead do that with a new 
interface (PigToStreamV2) and then at compile time determine which interface is 
being implemented and act accordingly.  This is actually better than what I 
initially suggested as the determination can be made at compile time.  If you 
choose this route you should also change PIgToStreamV2 to an abstract class so 
that in the future we can add methods without going through this dance.

 Avoid extra byte array copy in streaming deserialize
 

 Key: PIG-3255
 URL: https://issues.apache.org/jira/browse/PIG-3255
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-3255-1.patch, PIG-3255-2.patch, PIG-3255-3.patch


 PigStreaming.java:
  public Tuple deserialize(byte[] bytes) throws IOException {
 Text val = new Text(bytes);  
 return StorageUtil.textToTuple(val, fieldDel);
 }
 Should remove new Text(bytes) copy and construct the tuple directly from the 
 bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Propose UDF

2013-09-04 Thread Alan Gates

A few questions:

1) Why did you try to use RANK?  I don't see how rank is part of this.
2) The semantics here aren't clear to me.  record_id appears to be crossed with 
name and id but name and id appear to be chosen in order.  If this is join 
semantics I'd have expected two more entries in B, one with (1, Alan, 8) and 
one with (1, Sarai, 7).  If you were just taking each element in order I'd have 
expected the last row to be (null, Sarai, 8) instead.
3) I'm not familiar with the name NLET.  Does that refer to a particular 
function or algorithm?

Alan.

On Aug 31, 2013, at 6:20 PM, Alan del Rio Mendez wrote:

 Hi Dev Team,
 
 I developed a UDF to handle the following situation on pig 10.0 and want to
 see if I could contribute with it to the project.
 
 Let us consider a BAG A with the following data:
 
 A:{record_id:{1),names:{(ALAN),(SARAI)}},ids:{(7),(8)}}
 
 and an expected bag B
 
 B:{{record_id:(1),name:(ALAN),
 id:(7)},{record_id:(1),name:(SARAI), id:(8)}}
 
 Basically I propose a UDF NLET that takes N data bags containing the same
 M elements each of them and creates M tuples with N fields and that is used
 this way:
 
 B = FOREACH A GENERATE record_id, FLATTEN(NLET(names,ids));
 
 I tried to handle the situation described above using JOIN and RANK to
 join the databags, and even though it is not optimal it dind't work, when
 using RANK for the join it generated runtime errors.
 
 B1 = FOREACH A GENERATE record_id, FLATTEN(names);
 B11 = RANK B1;
 B2 = FOREACH A GENERATE FLATTEN(ids);
 B22 = RANK B2;
 C = JOIN B11 BY rank_B1 LEFT OUTER,B22 by rank_B2;   Run time error
 
 I spend some time reading the reference manual information:
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html
http://pig.apache.org/docs/r0.11.0/basic.html
 and didn't identified a workaround to what I'm describing. I also read the
 UDF manual http://wiki.apache.org/pig/UDFManual to develop the function
 create the NLET UDF.
 
 This far the UDF does generate the expected result/tuples but doesn't add
 the schema information. If nobody has implemented this and it is worth to
 approve, I can spend time on adding the schema information and proper
 documentation.
 
 PS. I'm starting to get involved into the community  and I will try to send
 emails before future development starts to avoid duplicated efforts.
 
 Best regards
 Alan del Rio


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Slow Group By operator

2013-08-22 Thread Alan Gates

When data comes out of a map task, Hadoop serializes it so that it can know its 
exact size as it writes it into the output buffer.  To run it through the 
combiner it needs to deserialize it again, and then re-serialize it when it 
comes out.  So each pass through the combiner costs a serialize/deserialization 
pass, which is expensive and not worth it unless the data reduction is 
significant.  

In other words, the combiner can be slow because Java lacks a sizeof operator.

Alan.

On Aug 22, 2013, at 4:01 AM, Benjamin Jakobus wrote:

 Hi Cheolsoo,
 
 Thanks - I will try this now and get back to you.
 
 Out of interest; could you explain (or point me towards resources that
 would) why the combiner would be a problem?
 
 Also, could the fact that Pig builds an intermediary data structure (?)
 whilst Hive just performs a sort then the arithmetic operation explain the
 slowdown?
 
 (Apologies, I'm quite new to Pig/Hive - just my guesses).
 
 Regards,
 Benjamin
 
 
 On 22 August 2013 01:07, Cheolsoo Park piaozhe...@gmail.com wrote:
 
 Hi Benjamin,
 
 Thank you very much for sharing detailed information!
 
 1) From the runtime numbers that you provided, the mappers are very slow.
 
 CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700
 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
 
 2) In your GROUP BY query, you have an algebraic UDF COUNT.
 
 I am wondering whether disabling combiner will help here. I have seen a lot
 of cases where combiner actually hurt performance significantly if it
 doesn't combine mapper outputs significantly. Briefly looking at
 generate_data.pl in PIG-200, it looks like a lot of random keys are
 generated. So I guess you will end up with a large number of small bags
 rather than a small number of large bags. If that's the case, combiner will
 only add overhead to mappers.
 
 Can you try to include this set pig.exec.nocombiner true; and see whether
 it helps?
 
 Thanks,
 Cheolsoo
 
 
 
 
 
 
 On Wed, Aug 21, 2013 at 3:52 AM, Benjamin Jakobus jakobusbe...@gmail.com
 wrote:
 
 Hi Cheolsoo,
 
 What's your query like? Can you share it? Do you call any algebraic UDF
 after group by? I am wondering whether combiner matters in your test.
 I have been running 3 different types of queries.
 
 The first was performed on datasets of 6 different sizes:
 
 
   - Dataset size 1: 30,000 records (772KB)
   - Dataset size 2: 300,000 records (6.4MB)
   - Dataset size 3: 3,000,000 records (63MB)
   - Dataset size 4: 30 million records (628MB)
   - Dataset size 5: 300 million records (6.2GB)
   - Dataset size 6: 3 billion records (62GB)
 
 The datasets scale linearly, whereby the size equates to 3000 * 10n .
 A seventh dataset consisting of 1,000 records (23KB) was produced to
 perform join
 operations on. Its schema is as follows:
 name - string
 marks - integer
 gpa - float
 The data was generated using the generate data.pl perl script available
 for
 download
 from https://issues.apache.org/jira/browse/PIG-200 to produce the
 datasets. The results are as follows:
 
 
 *  * *  * *  * *Set 1  * *Set 2**  * *Set 3**  *
 *Set
 4**  * *Set 5**  * *Set 6*
 *Arithmetic**  * 32.82*  * 36.21*  * 49.49*  * 83.25*
 *
 423.63*  * 3900.78
 *Filter 10%**  * 32.94*  * 34.32*  * 44.56*  * 66.68*
 *
 295.59*  * 2640.52
 *Filter 90%**  * 33.93*  * 32.55*  * 37.86*  * 53.22*
 *
 197.36*  * 1657.37
 *Group**  * *  *49.43*  * 53.34*  * 69.84*  * 105.12*
   *497.61*  * 4394.21
 *Join**  * *  *   49.89*  * 50.08*  * 78.55*  *
 150.39*
   *1045.34* *10258.19
 *Averaged performance of arithmetic, join, group, order, distinct select
 and filter operations on six datasets using Pig. Scripts were configured
 as
 to use 8 reduce and 11 map tasks.*
 
 
 
 *  * *  Set 1**  * *Set 2**  * *Set 3**  *
 *Set
 4**  * *Set 5**  * *Set 6*
 *Arithmetic**  *  32.84*  * 37.33*  * 72.55*  * 300.08
 2633.7227821.19
 *Filter 10%  *   32.36*  * 53.28*  * 59.22*  * 209.5*
 *
 1672.3* *18222.19
 *Filter 90%  *  31.23*  * 32.68*  *  36.8*  *  69.55*
 *
 331.88* *3320.59
 *Group  * *  * 48.27*  * 47.68*  * 46.87*  * 53.66*
 *141.36* *1233.4
 *Join  * *  * *   *48.54*  *56.86*  * 104.6*  *
 517.5*
   * 4388.34*  * -
 *Distinct**  * * *48.73*  *53.28*  * 72.54*  *
 109.77*
   * - *  * *  *  -
 *Averaged performance of arithmetic, join, group, distinct select and
 filter operations on six datasets using Hive. Scripts were configured as
 to
 use 8 reduce and 11 map tasks.*
 
 (If you want to see the standard deviation, let me know).
 
 So, to summarize the results: Pig outperforms Hive, with the exception of
 using *Group By*.
 
 The Pig scripts used for this benchmark are as follows:
 *Arithmetic*
 --

Re: JsonLoader fails the pig job in case of malformed json input

2013-08-08 Thread Alan Gates

Definitely, please provide a patch.

Alan.

On Aug 8, 2013, at 4:58 AM, Demeter Sztanko wrote:

 Hi all,
 
 Suppose I have a text file that contains only one line:
 {a, bad}
 
 This is obviously not a valid json.
 
 This input fails the this simple script:
 b = load 'bad.input' using JsonLoader('a0: chararray');
 dump b;
 
 
 Same script works fine for this line:
 {a: good}
 
 I was expecting that it will just skip the line and go further.
 
 I could not find any bug report for this. Is anyone working on that?
 In case if not, would you mind if I submit a patch for it?
 A simple handling of exception seems to solve the problem.
 
 Thanks,
 
 Dimi.

Re: Pig and Storm

2013-07-24 Thread Alan Gates

This sounds exciting.  The next question is how do you plan to do it?  Would a 
physical plan be translated to a Storm job (or jobs)?  Would it need a 
different physical plan?  Or would you just have the connection at the language 
layer and all the planning separate?  Do you envision needing 
extensions/changes to the language to support Storm?  Feel free to add a page 
to Pig's wiki with your thoughts on an approach.

Alan.

On Jul 23, 2013, at 9:52 AM, Pradeep Gollakota wrote:

 Hi Pig Developers,
 
 I wanted to reach out to you all and ask for you opinion on something.
 
 As a Pig user, I have come to love Pig as a framework. Pig provides a great
 set of abstractions that make working with large datasets easy. Currently
 Pig is only backed by hadoop. However, with the new rise of Twitter Storm
 as a distributed real time processing engine, Pig users are missing out on
 a great opportunity to be able to work with Pig in Storm. As a user of Pig,
 Hadoop and Storm, and keeping with the Pig philosophy of Pigs live
 anywhere, I'd like to get your thoughts on starting the implementation of
 a Pig backend for Storm.
 
 Thanks
 Pradeep

[jira] [Updated] (PIG-2248) Pig parser does not detect when a macro name masks a UDF name

2013-07-24 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-2248:


Status: Open  (was: Patch Available)

Canceling patch as discussion is still on-going as to best approach

 Pig parser does not detect when a macro name masks a UDF name
 -

 Key: PIG-2248
 URL: https://issues.apache.org/jira/browse/PIG-2248
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.9.0
Reporter: Alan Gates
Assignee: Johnny Zhang
Priority: Minor
 Attachments: PIG-2248.patch.txt, PIG-2248.patch.txt, 
 PIG-2248.patch.txt, PIG-2248.patch.txt


 Pig accepts a macro like:
 {code}
 define COUNT(in_relation, min_gpa) returns c {
b = filter $in_relation by gpa = $min_gpa;
$c = foreach b generate age, name;
}
 {code}
 This should produce a warning that it is masking a UDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3389) Set job.name does not work with dump command

2013-07-24 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718904#comment-13718904
 ] 

Alan Gates commented on PIG-3389:
-

+1

 Set job.name does not work with dump command
 --

 Key: PIG-3389
 URL: https://issues.apache.org/jira/browse/PIG-3389
 Project: Pig
  Issue Type: Bug
  Components: grunt
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Minor
 Fix For: 0.12

 Attachments: PIG-3389.patch


 The job.name property can be used to overwrite the default job name in Pig, 
 but the dump command does not honor it.
 To reproduce the issue, run the following commands in Grunt shell in MR mode:
 {code}
 SET job.name 'FOO';
 a = LOAD '/foo';
 DUMP a;
 {code}
 You will see the job name is not 'FOO' in the JT UI. However, using store 
 instead of dump sets the job name correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3247) Piggybank functions to mimic OVER clause in SQL

2013-07-19 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3247:


  Resolution: Fixed
Release Note: Added OVER clause like functionality in Piggybank.
  Status: Resolved  (was: Patch Available)

Patch committed.  Thanks Cheolsoo for the review.

 Piggybank functions to mimic OVER clause in SQL
 ---

 Key: PIG-3247
 URL: https://issues.apache.org/jira/browse/PIG-3247
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.12

 Attachments: Over.2.patch, Over.patch


 In order to test Hive I have written some UDFs to mimic the behavior of SQL's 
 OVER clause.  I thought they would be useful to share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PIG-3372) test

2013-07-16 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-3372.
-

Resolution: Invalid

 test
 

 Key: PIG-3372
 URL: https://issues.apache.org/jira/browse/PIG-3372
 Project: Pig
  Issue Type: Test
  Components: impl
Reporter: Manuel
Priority: Trivial

 test

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Fwd: DesignLounge @ HadoopSummit

2013-06-24 Thread Alan Gates

Begin forwarded message:

 From: Eric Baldeschwieler eri...@hortonworks.com
 Date: June 23, 2013 9:32:12 PM PDT
 To: common-...@hadoop.apache.org common-...@hadoop.apache.org, 
 mapreduce-...@hadoop.apache.org mapreduce-...@hadoop.apache.org, 
 hdfs-...@hadoop.apache.org hdfs-...@hadoop.apache.org
 Subject: DesignLounge @ HadoopSummit
 Reply-To: common-...@hadoop.apache.org

 Hi Folks,

 I've integrated the feedback I've gotten on the design lounge.  A couple of 
 clarifications:

 1) The space will be open both days of the formal summit.  Apache Committers 
 / contributors are invited to stop by any time and use the space to meet / 
 network any time during the show.

 2) Below I've listed the times that various project members have suggested 
 they will be present to talk with others contributors about their project.  
 If we get a big showing for any of these slots we'll encourage folks to do 
 the unconference thing: Select a set of topics they want to talk about and 
 break up into groups to do so.

 3) This is an experiment.  Our goal is to make the summit as useful as 
 possible to the folks who build the Apache projects in the Apache Hadoop 
 stack.  Please let me know how it works for you and ideas for making this 
 even more effective.

 Committed times so far, with topic champion (Note - I've adjusted suggested 
 times to fit with the program a bit more smoothly):

 Wednesday
 11-1 - Hive - Ashutosh - The stinger initiative and other Hive activities
 2 - 4 - Security breakout - Kevin Minder - HSSO, Knox, Rhino
 3 - 4 - Frameworks to run services like HBase on Yarn - Weave, Hoya … - 
 Devaraj Das
 4 - 5 - Accumulo - Billie Rinaldi

 Thursday
 11-1 - Finishing Yarn - Arun Murthy - Near term improvements needed
 2 - 4 - HDFS - Suresh  Sanjay
 4 - 5 - Getting involved in Apache - Billie Rinaldi

 See you all soon!

 E14

 PS Please forward to other Apache -dev lists and CC me.  Thanks!

 On Jun 11, 2013, at 10:42 AM, Eric Baldeschwieler eri...@hortonworks.com 
 wrote:

 Hi Folks,

 We thought we'd try something new at Hadoop Summit this year to build upon 
 two pieces of feedback I've heard a lot this year:

  • Apache project developers would like to take advantage of the Hadoop 
 summit to meet with their peers to on work on specific technical details of 
 their projects
  • That they want to do this during the summit, not before it starts or 
 at night. I've been told BoFs and other such traditional formats have not 
 historically worked for them, because they end up being about educating 
 users about their projects, not actually working with their peers on how to 
 make their projects better.
 So we are creating a space in the summit - marked in the event guide as 
 DesignLounge - concurrent with the presentation tracks where Apache Project 
 contributors can meet with their peers to plan the future of their project 
 or work through various technical issues near and dear to their hearts.

 We're going to provide white boards and message boards and let folks take it 
 from there in an unconference style.  We think there will be room for about 
 4 groups to meet at once.  Interested? Let me know what you think.  Send me 
 any ideas for how we can make this work best for you.

 The room will be 231A and B at the Hadoop Summit and will run from 10:30am 
 to 5:00pm on Day 1 (26th June), and we can also run from 10:30am to 5:00pm 
 on Day 2 (27th June) if we have a lot of topics that folk want to cover.

 Some of the early topics some folks told me they hope can be covered:

  • Hadoop Core security proposals.  There are a couple of detailed 
 proposals circulating.  Let's get together and hash out the differences.
  • Accumulo 1.6 features
  • The Hive vectorization project.  Discussion of the design and how to 
 phase it in incrementally with minimum complexity.
  • Finishing Yarn - what things need to get done NOW to make Yarn more 
 effective
 If you are a project lead for one of the Apache projects, look at the 
 schedule below and suggest a few slots when you think it would be best for 
 your project to meet.  I'll try to work out a schedule where no more than 2 
 projects are using the lounge at once.  

 Day 1, 26th June: 10:30am - 12:30pm, 1:45pm - 3:30pm, 3:45pm - 5:00pm

 Day 2, 27th June: 10:30am - 12:30pm, 1:45pm - 3:30pm, 3:45pm - 5:00pm

 It will be up to you, the hadoop contributors, from there.

 Look forward to seeing you all at the summit,

 E14

 PS Please forward to the other -dev lists.  This event is for folks on the 
 -dev lists.

Fwd: DesignLounge @ HadoopSummit

2013-06-13 Thread Alan Gates

Begin forwarded message:

 From: Eric Baldeschwieler eri...@hortonworks.com
 Date: June 11, 2013 10:46:25 AM PDT
 To: common-...@hadoop.apache.org common-...@hadoop.apache.org
 Subject: DesignLounge @ HadoopSummit
 Reply-To: common-...@hadoop.apache.org

 Hi Folks,

 We thought we'd try something new at Hadoop Summit this year to build upon 
 two pieces of feedback I've heard a lot this year:

 Apache project developers would like to take advantage of the Hadoop summit 
 to meet with their peers to on work on specific technical details of their 
 projects
 That they want to do this during the summit, not before it starts or at 
 night. I've been told BoFs and other such traditional formats have not 
 historically worked for them, because they end up being about educating users 
 about their projects, not actually working with their peers on how to make 
 their projects better.
 So we are creating a space in the summit - marked in the event guide as 
 DesignLounge - concurrent with the presentation tracks where Apache Project 
 contributors can meet with their peers to plan the future of their project or 
 work through various technical issues near and dear to their hearts.

 We're going to provide white boards and message boards and let folks take it 
 from there in an unconference style.  We think there will be room for about 4 
 groups to meet at once.  Interested? Let me know what you think.  Send me any 
 ideas for how we can make this work best for you.

 The room will be 231A and B at the Hadoop Summit and will run from 10:30am to 
 5:00pm on Day 1 (26th June), and we can also run from 10:30am to 5:00pm on 
 Day 2 (27th June) if we have a lot of topics that folk want to cover.

 Some of the early topics some folks told me they hope can be covered:

 Hadoop Core security proposals.  There are a couple of detailed proposals 
 circulating.  Let's get together and hash out the differences.
 Accumulo 1.6 features
 The Hive vectorization project.  Discussion of the design and how to phase it 
 in incrementally with minimum complexity.
 Finishing Yarn - what things need to get done NOW to make Yarn more effective
 If you are a project lead for one of the Apache projects, look at the 
 schedule below and suggest a few slots when you think it would be best for 
 your project to meet.  I'll try to work out a schedule where no more than 2 
 projects are using the lounge at once.  

 Day 1, 26th June: 10:30am - 12:30pm, 1:45pm - 3:30pm, 3:45pm - 5:00pm

 Day 2, 27th June: 10:30am - 12:30pm, 1:45pm - 3:30pm, 3:45pm - 5:00pm

 It will be up to you, the hadoop contributors, from there.

 Look forward to seeing you all at the summit,

 E14

 PS Please forward to the other -dev lists.  This event is for folks on the 
 -dev lists.

Re: Uploading patches for review

2013-06-06 Thread Alan Gates

I think it's fine for a reviewer to ask for a particular patch to be put in 
review board.  I think it would also be fine to put in our HowToContribute doc 
that for larger patches putting it in review board may help get it reviewed 
more quickly.  I'm not in favor of requiring it, as some reviewers don't use 
review board.

Alan.

On Jun 6, 2013, at 2:21 AM, Rohini Palaniswamy wrote:

 Hi,
Reviewing uploaded patches for few lines of change is easy. But when
 the change is more it is hard to read, review is more time consuming and at
 times you have to switch between the patch and eclipse to get more context.
 Without the surrounding code it is also easy to miss things on review. Can
 we make it a practice and decide on putting up the patches in review board
 for review if it is slightly bigger? Commenting on the patch is also a
 breeze in the review board.
 
 Thoughts ???
 
 Regards,
 Rohini

[jira] [Updated] (PIG-2956) Invalid cache specification for some streaming statement


 [ 
https://issues.apache.org/jira/browse/PIG-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-2956:


Status: Patch Available  (was: Open)

 Invalid cache specification for some streaming statement
 

 Key: PIG-2956
 URL: https://issues.apache.org/jira/browse/PIG-2956
 Project: Pig
  Issue Type: Sub-task
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG-2956-1_0.10.patch, PIG-2956-1.patch, PIG-2956-2.patch


 Another category of failure in e2e tests, such as ComputeSpec_1, 
 ComputeSpec_2, ComputeSpec_3, RaceConditions_1, RaceConditions_3, 
 RaceConditions_4, RaceConditions_7, RaceConditions_8.
 Here is stack:
 ERROR 6003: Invalid cache specification. File doesn't exist: C:/Program Files 
 (x86)/GnuWin32/bin/head.exe
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
  ERROR 2017: Internal error creating job configuration.
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:723)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:258)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:151)
 at org.apache.pig.PigServer.launchPlan(PigServer.java:1318)
 at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1303)
 at org.apache.pig.PigServer.execute(PigServer.java:1293)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:364)
 at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
 at org.apache.pig.Main.run(Main.java:561)
 at org.apache.pig.Main.main(Main.java:111)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6003: 
 Invalid cache specification. File doesn't exist: C:/Program Files 
 (x86)/GnuWin32/bin/head.exe
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.setupDistributedCache(JobControlCompiler.java:1151)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.setupDistributedCache(JobControlCompiler.java:1129)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:447)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2956) Invalid cache specification for some streaming statement


[ 
https://issues.apache.org/jira/browse/PIG-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669566#comment-13669566
 ] 

Alan Gates commented on PIG-2956:
-

+1

 Invalid cache specification for some streaming statement
 

 Key: PIG-2956
 URL: https://issues.apache.org/jira/browse/PIG-2956
 Project: Pig
  Issue Type: Sub-task
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG-2956-1_0.10.patch, PIG-2956-1.patch, PIG-2956-2.patch


 Another category of failure in e2e tests, such as ComputeSpec_1, 
 ComputeSpec_2, ComputeSpec_3, RaceConditions_1, RaceConditions_3, 
 RaceConditions_4, RaceConditions_7, RaceConditions_8.
 Here is stack:
 ERROR 6003: Invalid cache specification. File doesn't exist: C:/Program Files 
 (x86)/GnuWin32/bin/head.exe
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
  ERROR 2017: Internal error creating job configuration.
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:723)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:258)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:151)
 at org.apache.pig.PigServer.launchPlan(PigServer.java:1318)
 at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1303)
 at org.apache.pig.PigServer.execute(PigServer.java:1293)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:364)
 at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:133)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
 at org.apache.pig.Main.run(Main.java:561)
 at org.apache.pig.Main.main(Main.java:111)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6003: 
 Invalid cache specification. File doesn't exist: C:/Program Files 
 (x86)/GnuWin32/bin/head.exe
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.setupDistributedCache(JobControlCompiler.java:1151)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.setupDistributedCache(JobControlCompiler.java:1129)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:447)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3257) Add unique identifier UDF


[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669593#comment-13669593
 ] 

Alan Gates commented on PIG-3257:
-

Would it make you happy if we added to the javadoc comments on this function 
not to use it as a key in the same job it's generated in?

 Add unique identifier UDF
 -

 Key: PIG-3257
 URL: https://issues.apache.org/jira/browse/PIG-3257
 Project: Pig
  Issue Type: Improvement
  Components: internal-udfs
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.12

 Attachments: PIG-3257.patch


 It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3333) Fix remaining Windows core unit test failures


[ 
https://issues.apache.org/jira/browse/PIG-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669771#comment-13669771
 ] 

Alan Gates commented on PIG-:
-

StreamingCommand.addPathToCache - This appears to always convert the path from 
/ to \.  Don't we only want to do this in the Windows case?  Alternatively we 
could always convert / and \ to System.getProperties(file.separator).

JavaCompilerHelp.addClassToPath - Rather than if on windows/unix why not just 
change it to 
{code}
this.classPath = this.classPath+ System.getProperties(path.separator) +path;
{code}

It looks like a bunch of \r's slipped into TestSample.java



 Fix remaining Windows core unit test failures
 -

 Key: PIG-
 URL: https://issues.apache.org/jira/browse/PIG-
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG--1.patch


 I combine a bunch of Windows unit test fixes into one patch to make things 
 cleaner. They all originated from obvious Windows/Unix inconsistencies, which 
 includes:
 1. Path separator inconsistency: / vs \
 2. Path component separator inconsistency: : vs ;
 3. volume: is not acceptable as URI
 4. Unix tools/commands (eg, bash, rm) does not exist in Windows
 5. .sh script need a .cmd companion in Windows
 6. \r\n vs \n as newline
 7. Environment variable use different name (USER vs USERNAME)
 8. File not closed, not an issue in Unix, but an issue in Windows (not able 
 to remove a open file)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3334) Fix Windows piggybank unit test failures


[ 
https://issues.apache.org/jira/browse/PIG-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669774#comment-13669774
 ] 

Alan Gates commented on PIG-3334:
-

+1

 Fix Windows piggybank unit test failures
 

 Key: PIG-3334
 URL: https://issues.apache.org/jira/browse/PIG-3334
 Project: Pig
  Issue Type: Sub-task
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG-3334-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3337) Fix remaining Window e2e tests


[ 
https://issues.apache.org/jira/browse/PIG-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669776#comment-13669776
 ] 

Alan Gates commented on PIG-3337:
-

+1

 Fix remaining Window e2e tests
 --

 Key: PIG-3337
 URL: https://issues.apache.org/jira/browse/PIG-3337
 Project: Pig
  Issue Type: Sub-task
  Components: e2e harness
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG-3337-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668691#comment-13668691
 ] 

Alan Gates commented on PIG-3257:
-

No it would not, but it would be very weird to use this as a key anyway, since 
it would produce a different random key for each record.  I can't see how it 
would matter whether it produced random key X1 vs random key X2 for any given 
record.

 Add unique identifier UDF
 -

 Key: PIG-3257
 URL: https://issues.apache.org/jira/browse/PIG-3257
 Project: Pig
  Issue Type: Improvement
  Components: internal-udfs
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.12

 Attachments: PIG-3257.patch


 It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668748#comment-13668748
 ] 

Alan Gates edited comment on PIG-3257 at 5/28/13 10:32 PM:
---

I don't see how records can be missing or redundant.  Take the following query:

{code}
A = load ...
B = group A by UUID();
C = foreach B...
{code}

This won't reduce at all.  For every record it is totally irrelevant what 
particular value its key is, because it's guaranteed to be unique for each 
record.  So 1) this is a totally meaningless thing to do; 2) if a particular 
map does get rerun or is used in speculative execution it doesn't matter 
because which particular key is generated by UUID is irrelevant.  The way this 
intended to be used is something like this:

{code}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{code}


  was (Author: alangates):
I don't see how records can be missing or redundant.  Take the following 
query:

{code}
A = load ...
B = group A by UUID();
C = foreach B...
{code]

This won't reduce at all.  For every record it is totally irrelevant what 
particular value its key is, because it's guaranteed to be unique for each 
record.  So 1) this is a totally meaningless thing to do; 2) if a particular 
map does get rerun or is used in speculative execution it doesn't matter 
because which particular key is generated by UUID is irrelevant.  The way this 
intended to be used is something like this:

{code}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{code}

  
 Add unique identifier UDF
 -

 Key: PIG-3257
 URL: https://issues.apache.org/jira/browse/PIG-3257
 Project: Pig
  Issue Type: Improvement
  Components: internal-udfs
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.12

 Attachments: PIG-3257.patch


 It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668748#comment-13668748
 ] 

Alan Gates commented on PIG-3257:
-

I don't see how records can be missing or redundant.  Take the following query:

{code}
A = load ...
B = group A by UUID();
C = foreach B...
{code]

This won't reduce at all.  For every record it is totally irrelevant what 
particular value its key is, because it's guaranteed to be unique for each 
record.  So 1) this is a totally meaningless thing to do; 2) if a particular 
map does get rerun or is used in speculative execution it doesn't matter 
because which particular key is generated by UUID is irrelevant.  The way this 
intended to be used is something like this:

{code}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{code}


 Add unique identifier UDF
 -

 Key: PIG-3257
 URL: https://issues.apache.org/jira/browse/PIG-3257
 Project: Pig
  Issue Type: Improvement
  Components: internal-udfs
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.12

 Attachments: PIG-3257.patch


 It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: CHANGES.txt in trunk

2013-05-06 Thread Alan Gates

Cool, just wanted to make sure.  I agree this is a good idea.

Alan.

On May 5, 2013, at 7:06 PM, Rohini Palaniswamy wrote:

 Alan,
  I meant relocating only - Moving jiras from 0.12 to 0.11.x releases
 section :).
 
 Regards,
 Rohini
 
 
 On Fri, May 3, 2013 at 3:08 PM, Alan Gates ga...@hortonworks.com wrote:
 
 What do mean by remove?  They should still be in the file.  They may need
 to be relocated under the 0.11 section.  But the trunk CHANGES file should
 include all changes that are on trunk.
 
 Alan.
 
 On May 3, 2013, at 1:34 PM, Rohini Palaniswamy wrote:
 
 Hi,
  I see lot of patches that went into 0.11 are under trunk in the
 CHANGES.txt. Should we sync the file with the CHANGES.txt in branch-0.11
 and remove those jiras from trunk that went into 0.11? What is the usual
 process of updating CHANGES.txt when a jira is checked both into a branch
 and also trunk?
 
 Regards,
 Rohini

Re: CHANGES.txt in trunk

2013-05-03 Thread Alan Gates

What do mean by remove?  They should still be in the file.  They may need to be 
relocated under the 0.11 section.  But the trunk CHANGES file should include 
all changes that are on trunk.

Alan.

On May 3, 2013, at 1:34 PM, Rohini Palaniswamy wrote:

 Hi,
   I see lot of patches that went into 0.11 are under trunk in the
 CHANGES.txt. Should we sync the file with the CHANGES.txt in branch-0.11
 and remove those jiras from trunk that went into 0.11? What is the usual
 process of updating CHANGES.txt when a jira is checked both into a branch
 and also trunk?
 
 Regards,
 Rohini

Re: A major addition to Pig. Working with spatial data

2013-05-02 Thread Alan Gates

I know this is frustrating, but the different licenses do have different 
requirements that make it so that Apache can't ship GPL code.  A legal 
explanation is at http://www.apache.org/licenses/GPL-compatibility.html  For 
additional info on the LGPL specific questions see 
http://www.apache.org/legal/3party.html

As far as pulling it in via ivy, the issue isn't so much where the code lives 
as much as what code we are requiring to make Pig work.  If something that is 
[L]GPL is required for Pig it violates Apache rules as outlined above.  It also 
would be a show stopper for a lot of companies that redistribute Pig and that 
are allergic to GPL software.

So, as I said before, if you wanted to continue with that library and they are 
not willing to relicense it then it would have to be bolted on after Apache Pig 
is built.  Nothing stops you from doing this by downloading Apache Pig, adding 
this library and your code, and redistributing, though it wouldn't then be open 
to all Pig users.

Alan.

On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote:

 Thanks for your response. I was never good at differentiating all those
 open source licenses. I mean what is the point making open source licenses
 if it blocks me from using a library in an open source project. Any way,
 I'm not going into debate here. Just one question, if we use JTS as a
 library (jar file) without adding the code in Pig, is it still a violation?
 We'll use ivy, for example, to download the jar file when compiling.
 On May 1, 2013 7:50 PM, Alan Gates ga...@hortonworks.com wrote:
 
 Passing on the technical details for a moment, I see a licensing issue.
 JTS is licensed under LGPL.  Apache projects cannot contain or ship
 [L]GPL.  Apache does not meet the requirements of GPL and thus we cannot
 repackage their code. If you wanted to go forward using that class this
 would have to be packaged as an add on that was downloaded separately and
 not from Apache.  Another option is to work with the JTS community and see
 if they are willing to dual license their code under BSD or Apache license
 so that Pig could include it.  If neither of those are an option you would
 need to come up with a new class to contain your spatial data.
 
 Alan.
 
 On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote:
 
 Hi all,
 First, sorry for the long email. I wanted to put all my thoughts here
 and
 get your feedback.
 I'm proposing a major addition to Pig that will greatly increase its
 functionality and user base. It is simply to add spatial support to the
 language and the framework. I've already started working on that but I
 don't want it to be just another branch. I want it, eventually, to be
 merged with the trunk of Apache Pig. So, I'm sending this email mainly to
 reach out the main contributors of Pig to see the feasibility of this.
 This addition is a part of a big project we have been working on in
 University of Minnesota; the project is called Spatial Hadoop.
 http://spatialhadoop.cs.umn.edu. It's about building a MapReduce
 framework
 (Hadoop) that is capable of maintaining and analyzing spatial data
 efficiently. I'm the main guy behind that project and since we released
 its
 first version, we received very encouraging responses from different
 groups
 in the research and industrial community. I'm sure the addition we want
 to
 make to Pig Latin will be widely accepted by the people in the spatial
 community.
 I'm proposing a plan here while we're still in the early phases of this
 task to be able to discuss it with the main contributors and see its
 feasibility. First of all, I think that we need to change the core of Pig
 to be able to support spatial data. Providing a set of UDFs only is not
 enough. The main reason is that Pig Latin does not provide a way to
 create
 a new data type which is needed for spatial data. Once we have the
 spatial
 data types we need, the functionality can be expanded using more UDFs.
 
 Here's the plan as I see it.
 1- Introduce a new primitive data type Geometry which represents all
 spatial data types. In the underlying system, this will map to
 com.vividsolutions.jts.geom.Geometry. This is a class from Java Topology
 Suite (JTS) [http://www.vividsolutions.com/jts/JTSHome.htm], a stable
 and
 efficient open source Java library for spatial data types and algorithms.
 It is very popular in the spatial community and a C++ port of it is used
 in
 PostGIS [http://postgis.net/] (a spatial library for Postgres). JTS also
 conforms with Open Geospatial Consortium (OGC) [
 http://www.opengeospatial.org/] which is an open standard for the
 spatial
 data types. The Geometry data type is read from and written to text files
 using the Well Known Text (WKT) format. There is also a way to convert it
 to/from binary so that it can work with binary files and streams.
 2- Add functions that manipulate spatial data types. These will be added
 as
 UDFs and we will not need to mess with the internals of Pig. Most
 probably

Re: A major addition to Pig. Working with spatial data

2013-05-01 Thread Alan Gates

Passing on the technical details for a moment, I see a licensing issue.  JTS is 
licensed under LGPL.  Apache projects cannot contain or ship [L]GPL.  Apache 
does not meet the requirements of GPL and thus we cannot repackage their code. 
If you wanted to go forward using that class this would have to be packaged as 
an add on that was downloaded separately and not from Apache.  Another option 
is to work with the JTS community and see if they are willing to dual license 
their code under BSD or Apache license so that Pig could include it.  If 
neither of those are an option you would need to come up with a new class to 
contain your spatial data.

Alan.

On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote:

 Hi all,
  First, sorry for the long email. I wanted to put all my thoughts here and
 get your feedback.
  I'm proposing a major addition to Pig that will greatly increase its
 functionality and user base. It is simply to add spatial support to the
 language and the framework. I've already started working on that but I
 don't want it to be just another branch. I want it, eventually, to be
 merged with the trunk of Apache Pig. So, I'm sending this email mainly to
 reach out the main contributors of Pig to see the feasibility of this.
 This addition is a part of a big project we have been working on in
 University of Minnesota; the project is called Spatial Hadoop.
 http://spatialhadoop.cs.umn.edu. It's about building a MapReduce framework
 (Hadoop) that is capable of maintaining and analyzing spatial data
 efficiently. I'm the main guy behind that project and since we released its
 first version, we received very encouraging responses from different groups
 in the research and industrial community. I'm sure the addition we want to
 make to Pig Latin will be widely accepted by the people in the spatial
 community.
 I'm proposing a plan here while we're still in the early phases of this
 task to be able to discuss it with the main contributors and see its
 feasibility. First of all, I think that we need to change the core of Pig
 to be able to support spatial data. Providing a set of UDFs only is not
 enough. The main reason is that Pig Latin does not provide a way to create
 a new data type which is needed for spatial data. Once we have the spatial
 data types we need, the functionality can be expanded using more UDFs.
 
 Here's the plan as I see it.
 1- Introduce a new primitive data type Geometry which represents all
 spatial data types. In the underlying system, this will map to
 com.vividsolutions.jts.geom.Geometry. This is a class from Java Topology
 Suite (JTS) [http://www.vividsolutions.com/jts/JTSHome.htm], a stable and
 efficient open source Java library for spatial data types and algorithms.
 It is very popular in the spatial community and a C++ port of it is used in
 PostGIS [http://postgis.net/] (a spatial library for Postgres). JTS also
 conforms with Open Geospatial Consortium (OGC) [
 http://www.opengeospatial.org/] which is an open standard for the spatial
 data types. The Geometry data type is read from and written to text files
 using the Well Known Text (WKT) format. There is also a way to convert it
 to/from binary so that it can work with binary files and streams.
 2- Add functions that manipulate spatial data types. These will be added as
 UDFs and we will not need to mess with the internals of Pig. Most probably,
 there will be one new class for each operation (e.g., union or
 intersection). I think it will be good to put these new operations inside
 the core of Pig so that users can use it without having to write the fully
 qualified class name. Also, since there is no way to implicitly cast a
 spatial data type to a non-spatial data types, there will not be any
 conflicts in existing operations or new operations. All new operations, and
 only the new operations, will be working on spatial data types. Here is an
 initial list of operations that can be added. All those operations are
 already implemented in JTS and the UDFs added to Pig will be just wrappers
 around them.
 **Predicates (used for spatial filtering)
 Equals
 Disjoint
 Intersects
 Touches
 Crosses
 Within
 Contains
 Overlaps
 
 **Operations
 Envelope
 Area
 Length
 Buffer
 ConvexHull
 Intersection
 Union
 Difference
 SymDifference
 
 **Aggregate functions
 Accum
 ConvexHull
 Union
 
 3- The third step is to implement spatial indexes (e.g., Grid or R-tree). A
 Pig loader and Pig output classes will be created for those indexes. Note
 that currently we have SpatialOutputFormat and SpatialInputFormat for those
 indexes inside the Spatial Hadoop project, but we need to tweak them to
 work with Pig.
 
 4- (Advanced) Implement more sophisticated algorithms for spatial
 operations that utilize the indexes. For example, we can have a specific
 algorithm for spatial range query or spatial join. Again, we already have
 algorithms built for different operations implemented in Spatial Hadoop as
 MapReduce programs, but they will need

[jira] [Updated] (PIG-3010) Allow UDF's to flatten themselves

2013-04-25 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3010:


Status: Open  (was: Patch Available)

Patch no longer applies.  This causes review board to not show the diffs 
either.  Sorry for waiting so long on this.

 Allow UDF's to flatten themselves
 -

 Key: PIG-3010
 URL: https://issues.apache.org/jira/browse/PIG-3010
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
 Fix For: 0.12

 Attachments: PIG-3010-0.patch, PIG-3010-1.patch, 
 PIG-3010-2_nowhitespace.patch, PIG-3010-2.patch, PIG-3010-3_nows.patch, 
 PIG-3010-3.patch, PIG-3010-4_nows.patch, PIG-3010-4.patch, 
 PIG-3010-5_nows.patch, PIG-3010-5.patch


 This is something I thought would be cool for a while, so I sat down and did 
 it because I think there are some useful debugging tools it'd help with.
 The idea is that if you attach an annotation to a UDF, the Tuple or DataBag 
 you output will be flattened. This is quite powerful. A very common pattern 
 is:
 a = foreach data generate Flatten(MyUdf(thing)) as (a,b,c);
 This would let you just do:
 a = foreach data generate MyUdf(thing);
 With the exact same result!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (PIG-3164) Pig current releases lack a UDF endsWith.This UDF tests if a given string ends with the specified suffix.

2013-04-25 Thread Alan Gates (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alan Gates reopened PIG-3164:
-

Backed these changes out; I should never have checked them in. I missed that
this was only in test and not in main, so I ended up compiling the wrong thing
to make sure this worked.

UDFs should not be added under piggybank/java/src/test. That's for unit tests
for the UDF. The UDFs should be under piggybank/java/src/main.

Thanks Niels for catching my mistake.

Pig current releases lack a UDF endsWith.This UDF tests if a given string
ends with the specified suffix.
-

Key: PIG-3164
URL: https://issues.apache.org/jira/browse/PIG-3164
Project: Pig
Issue Type: New Feature
Components: piggybank
Affects Versions: 0.10.0
Reporter: Anuroopa George
Assignee: Anuroopa George
Fix For: 0.12

Attachments: ENDSWITH.java.patch, ENDSWITH_updated.java

Pig current releases lack a UDF endsWith.This UDF tests if a given string
ends with the specified suffix.This UDF returns true if the character
sequence represented by the string argument given as a suffix is a suffix of
the character sequence represented by the given string; false otherwise.Also
true will be returned if the given suffix is an empty string or is equal to
the given String.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3027) pigTest unit test needs a newline filter for comparisons of golden multi-line

2013-04-23 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3027:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks John.

 pigTest unit test needs a newline filter for comparisons of golden multi-line
 -

 Key: PIG-3027
 URL: https://issues.apache.org/jira/browse/PIG-3027
 Project: Pig
  Issue Type: Sub-task
  Components: build
Affects Versions: 0.10.0
Reporter: John Gordon
Assignee: John Gordon
 Fix For: 0.12

 Attachments: PIG-3027.trunk.1.patch


 pigTest leverages assertOutput throughout for text file comparisons to golden 
 checked-in baselines.  This method doesn't take into account line ending 
 differences across platforms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3198) Let users use any function from PigType - PigType as if it were builtlin


[ 
https://issues.apache.org/jira/browse/PIG-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635744#comment-13635744
 ] 

Alan Gates commented on PIG-3198:
-

I looked through this.  Other than spare tabs (rather than spaces) in some of 
the files it looks good.  +1.  I think this is exciting functionality.  I'm 
glad to see it added.

 Let users use any function from PigType - PigType as if it were builtlin
 -

 Key: PIG-3198
 URL: https://issues.apache.org/jira/browse/PIG-3198
 Project: Pig
  Issue Type: Bug
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
 Fix For: 0.12

 Attachments: PIG-3198-0.patch


 This idea is an extension of PIG-2643. Ideally, someone should be able to 
 call any function currently registered in Pig as if it were builtin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3173) Partition filter push down does not happen partition keys condition include a AND and OR construct


 [ 
https://issues.apache.org/jira/browse/PIG-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3173:


Status: Open  (was: Patch Available)

Canceling patch until feedback from Dmitriy is addressed.

 Partition filter push down does not happen partition keys condition include a 
 AND and OR construct
 --

 Key: PIG-3173
 URL: https://issues.apache.org/jira/browse/PIG-3173
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.1
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-3173-1.patch


 A = load 'db.table' using org.apache.hcatalog.pig.HCatLoader();
 B = filter A by (region=='usa' AND dt=='201302051800') OR (region=='uk' AND 
 dt=='201302051800');
 C = foreach B generate name, age;
 DUMP C;
 gives the below warning and scans the whole table.
 2013-02-06 22:22:16,233 [main] WARN  
 org.apache.pig.newplan.PColFilterExtractor  - No partition filter push down: 
 You have an partition column (region ) in a construction like: (pcond  and 
 ...) or (pcond and ...) where pcond is a condition on a partition column.
 2013-02-06 22:22:16,233 [main] WARN  
 org.apache.pig.newplan.PColFilterExtractor  - No partition filter push down: 
 You have an partition column (datestamp ) in a construction like: (pcond  and 
 ...) or (pcond and ...) where pcond is a condition on a partition column.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-3164) Pig current releases lack a UDF endsWith.This UDF tests if a given string ends with the specified suffix.


 [ 
https://issues.apache.org/jira/browse/PIG-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-3164:
---

Assignee: Anuroopa George

 Pig current releases lack a UDF endsWith.This UDF tests if a given string 
 ends with the specified suffix.
 -

 Key: PIG-3164
 URL: https://issues.apache.org/jira/browse/PIG-3164
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Affects Versions: 0.10.0
Reporter: Anuroopa George
Assignee: Anuroopa George
 Fix For: 0.12

 Attachments: ENDSWITH.java.patch, ENDSWITH_updated.java


 Pig current releases lack a UDF endsWith.This UDF tests if a given string  
 ends with the specified suffix.This UDF returns true if the character 
 sequence represented by the string argument given as a suffix is a suffix of 
 the character sequence represented by the given string; false otherwise.Also 
 true will be returned if the given suffix is an empty string or is equal to 
 the given String.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3164) Pig current releases lack a UDF endsWith.This UDF tests if a given string ends with the specified suffix.


 [ 
https://issues.apache.org/jira/browse/PIG-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3164:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Anuroopa.

 Pig current releases lack a UDF endsWith.This UDF tests if a given string 
 ends with the specified suffix.
 -

 Key: PIG-3164
 URL: https://issues.apache.org/jira/browse/PIG-3164
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Affects Versions: 0.10.0
Reporter: Anuroopa George
Assignee: Anuroopa George
 Fix For: 0.12

 Attachments: ENDSWITH.java.patch, ENDSWITH_updated.java


 Pig current releases lack a UDF endsWith.This UDF tests if a given string  
 ends with the specified suffix.This UDF returns true if the character 
 sequence represented by the string argument given as a suffix is a suffix of 
 the character sequence represented by the given string; false otherwise.Also 
 true will be returned if the given suffix is an empty string or is equal to 
 the given String.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3114) Duplicated macro name error when using pigunit


 [ 
https://issues.apache.org/jira/browse/PIG-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3114:


Status: Open  (was: Patch Available)

Canceling patch pending agreement on how to address the issue.

 Duplicated macro name error when using pigunit
 --

 Key: PIG-3114
 URL: https://issues.apache.org/jira/browse/PIG-3114
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.11
Reporter: Chetan Nadgire
Assignee: Chetan Nadgire
 Fix For: 0.12

 Attachments: PIG-3114.patch, PIG-3114.patch


 I'm using PigUnit to test a pig script within which a macro is defined.
 Pig runs fine on cluster but getting parsing error with pigunit.
 So I tried very basic pig script with macro and getting similar error.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. line 9 null. Reason: Duplicated macro name 'my_macro_1'
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
   at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:988)
   at 
 org.apache.pig.pigunit.pig.GruntParser.processPig(GruntParser.java:61)
   at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
   at 
 org.apache.pig.pigunit.pig.PigServer.registerScript(PigServer.java:56)
   at org.apache.pig.pigunit.PigTest.registerScript(PigTest.java:160)
   at org.apache.pig.pigunit.PigTest.assertOutput(PigTest.java:231)
   at org.apache.pig.pigunit.PigTest.assertOutput(PigTest.java:261)
   at FirstPigTest.MyPigTest.testTop2Queries(MyPigTest.java:32)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at junit.framework.TestCase.runTest(TestCase.java:176)
   at junit.framework.TestCase.runBare(TestCase.java:141)
   at junit.framework.TestResult$1.protect(TestResult.java:122)
   at junit.framework.TestResult.runProtected(TestResult.java:142)
   at junit.framework.TestResult.run(TestResult.java:125)
   at junit.framework.TestCase.run(TestCase.java:129)
   at junit.framework.TestSuite.runTest(TestSuite.java:255)
   at junit.framework.TestSuite.run(TestSuite.java:250)
   at 
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84)
   at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
   at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: Failed to parse: line 9 null. Reason: Duplicated macro name 
 'my_macro_1'
   at 
 org.apache.pig.parser.QueryParserDriver.makeMacroDef(QueryParserDriver.java:406)
   at 
 org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java:277)
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:178)
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
   ... 30 more
  
 Pig script which is failing :
 {code:title=test.pig|borderStyle=solid}
 DEFINE my_macro_1 (QUERY, A) RETURNS C {
 $C = ORDER $QUERY BY total DESC, $A;
 } ;
 data =  LOAD 'input' AS (query:CHARARRAY);
 queries_group = GROUP data BY query;
 queries_count = FOREACH queries_group GENERATE group AS query, COUNT(data) AS 
 total;
 queries_ordered = my_macro_1(queries_count, query);
 queries_limit = LIMIT queries_ordered 2;
 STORE queries_limit INTO 'output';
 {code}
 If I remove macro pigunit works fine. Even just defining macro without using 
 it results in parsing error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3237) Pig current releases lack a UDF MakeSet(). This UDF returns a set value (a string containing substrings separated by , characters) consisting of the strings that have the

[
https://issues.apache.org/jira/browse/PIG-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alan Gates updated PIG-3237:

Fix Version/s: (was: 0.10.0)
Status: Open (was: Patch Available)

Thanks for the patch. Some belated feedback.

# Please add some documentation (preferably in the form of javadocs on the
class) explaining what this does. Looking over the code it's not clear to me
what you're trying to accomplish or even how this is related to creating a set.
# It needs unit tests
# You're hard wiring the number of allowed tokens in a couple of places. bits[]
and strings[] both have hard coded values. This will result in
IndexOutOfBoundsExceptions with no error message indicating why. These should
be extensible, or at least check the bounds and tell users they have exceeded
them.

Pig current releases lack a UDF MakeSet(). This UDF returns a set value (a
string containing substrings separated by , characters) consisting of the
strings that have the corresponding bit in the first argument

Key: PIG-3237
URL: https://issues.apache.org/jira/browse/PIG-3237
Project: Pig
Issue Type: New Feature
Affects Versions: 0.10.0
Reporter: Seethal Vincent
Attachments: MakeSet.java.patch

[jira] [Updated] (PIG-3238) Pig current releases lack a UDF Stuff(). This UDF deletes a specified length of characters and inserts another set of characters at a specified starting point.


 [ 
https://issues.apache.org/jira/browse/PIG-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3238:


Fix Version/s: (was: 0.10.0)
   Status: Open  (was: Patch Available)

 Pig current releases lack a UDF Stuff(). This UDF deletes a specified length 
 of characters and inserts another set of characters at a specified starting 
 point.
 ---

 Key: PIG-3238
 URL: https://issues.apache.org/jira/browse/PIG-3238
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.10.0
Reporter: Sonu Prathap
 Attachments: Stuff.java.patch


 Pig current releases lack a UDF Stuff(). This UDF deletes a specified length 
 of characters and inserts another set of characters at a specified starting 
 point.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3215) [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files


 [ 
https://issues.apache.org/jira/browse/PIG-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3215:


Status: Open  (was: Patch Available)

 [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files
 

 Key: PIG-3215
 URL: https://issues.apache.org/jira/browse/PIG-3215
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Reporter: MIYAKAWA Taku
Assignee: MIYAKAWA Taku
  Labels: piggybank
 Attachments: LTSVLoader-6.html, LTSVLoader.html, PIG-3215-6.patch, 
 PIG-3215.patch


 LTSV, or Labeled Tab-separated Values format is now getting popular in Japan 
 for log files, especially of web servers. The goal of this jira is to add 
 LTSVLoader in PiggyBank to load LTSV files.
 LTSV is based on TSV thus columns are separated by tab characters. 
 Additionally each of columns includes a label and a value, separated by : 
 character.
 Read about LTSV on http://ltsv.org/.
 h4. Example LTSV file (access.log)
 Columns are separated by tab characters.
 {noformat}
 host:host1.example.orgreq:GET /index.html ua:Opera/9.80
 host:host1.example.orgreq:GET /favicon.icoua:Opera/9.80
 host:pc.example.com   req:GET /news.html  ua:Mozilla/5.0
 {noformat}
 h4. Usage 1: Extract fields from each line
 Users can specify an input schema and get columns as Pig fields.
 This example loads the LTSV file shown in the previous section.
 {code}
 -- Parses the access log and count the number of lines
 -- for each pair of the host column and the ua column.
 access = LOAD 'access.log' USING 
 org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray');
 grouped_access = GROUP access BY (host, ua);
 count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua, 
 COUNT(access);
 DUMP count_for_host_ua;
 {code}
 The below text will be printed out.
 {noformat}
 (host1.example.org,Opera/9.80,2)
 (pc.example.com,Firefox/5.0,1)
 {noformat}
 h4. Usage 2: Extract a map from each line
 Users can get a map for each LTSV line. The key of a map is a label of the 
 LTSV column. The value of a map comes from characters after : in the LTSV 
 column.
 {code}
 -- Parses the access log and projects the user agent field.
 access = LOAD 'access.log' USING 
 org.apache.pig.piggybank.storage.LTSVLoader() AS (m:map[]);
 user_agent = FOREACH access GENERATE m#'ua' AS ua;
 DUMP user_agent;
 {code}
 The below text will be printed out.
 {noformat}
 (Opera/9.80)
 (Opera/9.80)
 (Firefox/5.0)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3190) Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization


 [ 
https://issues.apache.org/jira/browse/PIG-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3190:


Status: Open  (was: Patch Available)

Canceling patch until issues around location and build failures are resolved.

 Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization
 ---

 Key: PIG-3190
 URL: https://issues.apache.org/jira/browse/PIG-3190
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Affects Versions: 0.11
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.12

 Attachments: PIG-3190-2.patch, PIG-3190-3.patch, PIG-3190.patch


 TOKENIZE is literally useless. The Lucene Standard/Snowball tokenizers in 
 lucene, as used by, varaha is much more useful for actual tasks: 
 https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3193) Fix ant docs warnings


[ 
https://issues.apache.org/jira/browse/PIG-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633081#comment-13633081
 ] 

Alan Gates commented on PIG-3193:
-

+1.  For the two you didn't fix, why don't you open a separate JIRA so that you 
can resolve this one with the issues you addressed.

 Fix ant docs warnings
 ---

 Key: PIG-3193
 URL: https://issues.apache.org/jira/browse/PIG-3193
 Project: Pig
  Issue Type: Bug
  Components: build, documentation
Affects Versions: 0.11
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
  Labels: newbie
 Fix For: 0.12

 Attachments: PIG-3193.patch


 I see many warnings every time when I run ant clean docs. They don't break 
 build, but it would be nice if we could clean them if possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2767) Pig creates wrong schema after dereferencing nested tuple fields


[ 
https://issues.apache.org/jira/browse/PIG-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633111#comment-13633111
 ] 

Alan Gates commented on PIG-2767:
-

+1.

 Pig creates wrong schema after dereferencing nested tuple fields
 

 Key: PIG-2767
 URL: https://issues.apache.org/jira/browse/PIG-2767
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10.0
 Environment: Amazon EMR, patched to use Pig 0.10.0
Reporter: Jonathan Packer
Assignee: Daniel Dai
 Fix For: 0.12

 Attachments: PIG-2767-1.patch, test_data.txt


 The following script fails:
 data = LOAD 'test_data.txt' USING PigStorage() AS (f1: int, f2: int, f3:
 int, f4: int);
 nested = FOREACH data GENERATE f1, (f2, f3, f4) AS nested_tuple;
 dereferenced = FOREACH nested GENERATE f1, nested_tuple.(f2, f3);
 DESCRIBE dereferenced;
 uses_dereferenced = FOREACH dereferenced GENERATE nested_tuple.f3;
 DESCRIBE uses_dereferenced;
 The schema of dereferenced should be {f1: int, nested_tuple: (f2: int,
 f3: int)}. DESCRIBE thinks it is {f1: int, f2: int} instead. When dump is
 used, the data is actually in form of the correct schema however, ex.
 (1,(2,3))
 (5,(6,7))
 ...
 This is not just a problem with DESCRIBE. Because the schema is incorrect,
 the reference to nested_tuple in the uses_dereferenced statement is
 considered to be invalid, and the script fails to run. The error is:
 Invalid field projection. Projected field [nested_tuple] does not exist in
 schema: f1:int,f2:int.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-3186) tar/deb/pkg ant targets should depend on piggybank


 [ 
https://issues.apache.org/jira/browse/PIG-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-3186:
---

Assignee: Lorand Bendig

 tar/deb/pkg ant targets should depend on piggybank
 --

 Key: PIG-3186
 URL: https://issues.apache.org/jira/browse/PIG-3186
 Project: Pig
  Issue Type: Bug
Reporter: Bill Graham
Assignee: Lorand Bendig
  Labels: low-hanging-fruit, simple
 Fix For: 0.12

 Attachments: piggy.patch


 The tar, deb and rpm artifacts should contain piggybank but they don't when 
 built via ant unless piggybank is built separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3186) tar/deb/pkg ant targets should depend on piggybank