[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir

2013-03-11 Thread Jarek Jarcec Cecho (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598563#comment-13598563
 ] 

Jarek Jarcec Cecho commented on PIG-2591:
-

Hi [~cheolsoo] and [~prkommireddi],
thank you very much for your feedback. I agree with proposed steps and I'll be 
more than happy to execute them myself. I just have small comment, direct usage 
of {{FileLocalizer}} is not simple as it requires a lot of initialization (for 
example that class needs valid {{PigContext}}). Thus I would propose to move 
the actual logic of getting temporary directory into standalone easily reusable 
class {{TmpUtil}} (which is part of {{bugPIG-2591.patch}}).

Jarcec

 Unit tests should not write to /tmp but respect java.io.tmpdir
 --

 Key: PIG-2591
 URL: https://issues.apache.org/jira/browse/PIG-2591
 Project: Pig
  Issue Type: Bug
  Components: tools
Reporter: Thomas Weise
Assignee: Jarek Jarcec Cecho
 Fix For: 0.12

 Attachments: bugPIG-2591.patch, PIG-2495.patch


 Several tests use /tmp but should derive temporary file location from 
 java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test 
 run specific location in build.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir

2013-03-11 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598564#comment-13598564
 ] 

Cheolsoo Park commented on PIG-2591:


I agree with Jarcec. Sorry if my comment wasn't clear enough.

 Unit tests should not write to /tmp but respect java.io.tmpdir
 --

 Key: PIG-2591
 URL: https://issues.apache.org/jira/browse/PIG-2591
 Project: Pig
  Issue Type: Bug
  Components: tools
Reporter: Thomas Weise
Assignee: Jarek Jarcec Cecho
 Fix For: 0.12

 Attachments: bugPIG-2591.patch, PIG-2495.patch


 Several tests use /tmp but should derive temporary file location from 
 java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test 
 run specific location in build.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir

2013-03-11 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598580#comment-13598580
 ] 

Prashant Kommireddi commented on PIG-2591:
--

Thanks guys. Here are a couple cases I am thinking about:

1. On Windows machine - java.io.tempdir could default to something like, say 
C:\DOCUME~1\username\LOCALS~1\Temp\. This is fine if we are running unit 
tests on a local machine but could have unexpected results when pig attempts to 
create that directory on HDFS for intermediate files? 

2. Clarity - would it be better to separate these properties as 
pig.test.temp.dir and pig.temp.dir? It might be good (and expected) if we 
default pig.temp.dir to /tmp as that's been the case always.

Jarcec, I agree on what you propose too. It makes sense to separate the helper 
methods into TmpUtil. The patch does not currently use java.io.tmpdir, how do 
you see that being used?



 Unit tests should not write to /tmp but respect java.io.tmpdir
 --

 Key: PIG-2591
 URL: https://issues.apache.org/jira/browse/PIG-2591
 Project: Pig
  Issue Type: Bug
  Components: tools
Reporter: Thomas Weise
Assignee: Jarek Jarcec Cecho
 Fix For: 0.12

 Attachments: bugPIG-2591.patch, PIG-2495.patch


 Several tests use /tmp but should derive temporary file location from 
 java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test 
 run specific location in build.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir

2013-03-11 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598600#comment-13598600
 ] 

Cheolsoo Park commented on PIG-2591:


[~prkommireddi],
# If we set pig.temp.dir to ./build, it would work for both Linux and 
Windows.
# I am not proposing to change the default value of pig.temp.dir. I am 
proposing to set it for unit tests only in build.xml. I don't think we need to 
introduce an additional property.

Do you agree?

 Unit tests should not write to /tmp but respect java.io.tmpdir
 --

 Key: PIG-2591
 URL: https://issues.apache.org/jira/browse/PIG-2591
 Project: Pig
  Issue Type: Bug
  Components: tools
Reporter: Thomas Weise
Assignee: Jarek Jarcec Cecho
 Fix For: 0.12

 Attachments: bugPIG-2591.patch, PIG-2495.patch


 Several tests use /tmp but should derive temporary file location from 
 java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test 
 run specific location in build.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir

2013-03-11 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598607#comment-13598607
 ] 

Prashant Kommireddi commented on PIG-2591:
--

1. Agreed (assuming we no longer are considering java.io.tmpdir and instead 
using ./build?)
2. I am a bit unclear on the reasons behind reusing pig.temp.dir. It's a bit 
confusing to me that pig.temp.dir is set differently for tests vs intermediate 
store at different places. 

 Unit tests should not write to /tmp but respect java.io.tmpdir
 --

 Key: PIG-2591
 URL: https://issues.apache.org/jira/browse/PIG-2591
 Project: Pig
  Issue Type: Bug
  Components: tools
Reporter: Thomas Weise
Assignee: Jarek Jarcec Cecho
 Fix For: 0.12

 Attachments: bugPIG-2591.patch, PIG-2495.patch


 Several tests use /tmp but should derive temporary file location from 
 java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test 
 run specific location in build.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir

2013-03-11 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598620#comment-13598620
 ] 

Cheolsoo Park commented on PIG-2591:


PIG-2995 explains the reasoning. I think that the advantage of having a 
property such as pig.temp.dir is to use it on a case-by-case basis without 
having to changing code. So I don't understand why setting the property to 
different values for different purposes is confusing.

As far as I understand, both PIG-2995 and PIG-2591 are trying to solve the same 
problem. When you have automated builds, you want to control where builds 
generate temporary files. In fact, there are two kinds of files that tests 
generate. As you pointed out, 1) intermediate files and 2) output files (e.g. 
STORE foo INTO '/tmp'). To me, controlling them with a single property sounds 
better rather than having two properties.

Does this make sense?

 Unit tests should not write to /tmp but respect java.io.tmpdir
 --

 Key: PIG-2591
 URL: https://issues.apache.org/jira/browse/PIG-2591
 Project: Pig
  Issue Type: Bug
  Components: tools
Reporter: Thomas Weise
Assignee: Jarek Jarcec Cecho
 Fix For: 0.12

 Attachments: bugPIG-2591.patch, PIG-2495.patch


 Several tests use /tmp but should derive temporary file location from 
 java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test 
 run specific location in build.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3241) ConcurrentModificationException in POPartialAgg

2013-03-11 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598965#comment-13598965
 ] 

Rohini Palaniswamy commented on PIG-3241:
-

[~dvryaboy],
  No. We are currently running on Pig 0.10 only. Pig 0.11 testing is blocked 
due to lack of QE resources to certify new features. 

{code}
private MapObject, ListTuple rawInputMap = Maps.newHashMap();
private MapObject, ListTuple processedInputMap = Maps.newHashMap();
{code}

Don't think it has to do with Hadoop 2.0. A ConcurrentHashMap should have been 
used in the code. Exception not visible in unit tests as spill does not happen 
there.

 ConcurrentModificationException in POPartialAgg
 ---

 Key: PIG-3241
 URL: https://issues.apache.org/jira/browse/PIG-3241
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Lohit Vijayarenu
Priority: Blocker
 Fix For: 0.12, 0.11.1


 While running few PIG scripts against Hadoop 2.0, I see consistently see 
 ConcurrentModificationException 
 {noformat}
 at java.util.HashMap$HashIterator.remove(HashMap.java:811)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
 {noformat}
 It looks like there is rawInputMap is being modified while elements are 
 removed from it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3241) ConcurrentModificationException in POPartialAgg

2013-03-11 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599008#comment-13599008
 ] 

Dmitriy V. Ryaboy commented on PIG-3241:


Thing is, we are running this in production on 1.0 and don't observe this 
error.. 


 ConcurrentModificationException in POPartialAgg
 ---

 Key: PIG-3241
 URL: https://issues.apache.org/jira/browse/PIG-3241
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Lohit Vijayarenu
Priority: Blocker
 Fix For: 0.12, 0.11.1


 While running few PIG scripts against Hadoop 2.0, I see consistently see 
 ConcurrentModificationException 
 {noformat}
 at java.util.HashMap$HashIterator.remove(HashMap.java:811)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
 {noformat}
 It looks like there is rawInputMap is being modified while elements are 
 removed from it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir

2013-03-11 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599030#comment-13599030
 ] 

Cheolsoo Park commented on PIG-2591:


[~jarcec] and [~prkommireddi], I thought about this more last night. I withdraw 
my agreement on Jarcec's patch.
* I don't think we want to have the following method:
{code}
public static String getTemporaryDirectory() {
return getTemporaryDirectory(System.getProperties(), random);
}
{code}
Let's say that we set pig.temp.dir to /x in system properties (e.g. 
sysproperty in build.xml). But FilLocalizer.getTemporaryPath() is not 
effected by this because pig.temp.dir is not set in PigContext. So Pig will 
still generate intermediate data under file:///tmp. Now if we do store foo 
into TmpUtil.getTemporaryDirectory(), unit tests will generate temporary files 
into two directories: output files into /x and intermediate files into /tmp.
{quote}
The direct usage of FileLocalizer is not simple as it requires a lot of 
initialization (for example that class needs valid PigContext).
{quote}
In fact, this is good because that's how we guarantee all the temporary files 
go into a single directory controlled by PigContext. So we should use 
FilLocalizer.getTemporaryPath().
* I was wrong about setting pig.temp.dir in system properties in build.xml 
for unit tests. In fact, it won't be propagated to PigContext. So we should set 
pig.temp.dir in PigContext.
* All of these are only applicable to local mode (i.e. file://). If unit tests 
run on mini cluster, we shouldn't set pig.temp.dir. Currently, mini cluster 
already generates files under ./build, so we don't need to worry about them.
* We should use File.createTempFile() for local temporary files that are not 
generated by Pig. (e.g. Some tests generate input files on local and copy them 
to mini cluster.) Then, java.io.tmpdir will be automatically honored.

I think I covered all the cases. Please let me know what you think.

 Unit tests should not write to /tmp but respect java.io.tmpdir
 --

 Key: PIG-2591
 URL: https://issues.apache.org/jira/browse/PIG-2591
 Project: Pig
  Issue Type: Bug
  Components: tools
Reporter: Thomas Weise
Assignee: Jarek Jarcec Cecho
 Fix For: 0.12

 Attachments: bugPIG-2591.patch, PIG-2495.patch


 Several tests use /tmp but should derive temporary file location from 
 java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test 
 run specific location in build.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3241) ConcurrentModificationException in POPartialAgg

2013-03-11 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599097#comment-13599097
 ] 

Rohini Palaniswamy commented on PIG-3241:
-

I was wrong about iterator.remove() of HashMap throwing 
ConcurrentModificationException. Happens only while iterating directly over the 
entrySet()

{code}
for (EntryString, String entry : map.entrySet()) {
map.remove(entry.getKey());
}
{code}
throws ConcurrentModificationException

{code}
IteratorEntryString, String spillingIterator = map.entrySet().iterator();
EntryString, String entry = spillingIterator.next();
spillingIterator.remove();
{code}
does not throw ConcurrentModificationException

  So theoretically it should be fine if the map was accessed within a single 
thread. Need to investigate if there is more than one thread accessing 
POPartialAgg. Possible that it could be a 2.0 issue. Simple fix would be to 
change that to ConcurrentHashMap, but need to find the underlying cause to 
assess further impact. I am blocked with few deadlines before going on vacation 
and don't have time to spare for the investigation. I am fine if anyone else 
can take a look at this. Cheolsoo?

[~lohit],
   Can you change it to ConcurrentHashMap and compile and see that fixes the 
problem for you?
 

 ConcurrentModificationException in POPartialAgg
 ---

 Key: PIG-3241
 URL: https://issues.apache.org/jira/browse/PIG-3241
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Lohit Vijayarenu
Priority: Blocker
 Fix For: 0.12, 0.11.1


 While running few PIG scripts against Hadoop 2.0, I see consistently see 
 ConcurrentModificationException 
 {noformat}
 at java.util.HashMap$HashIterator.remove(HashMap.java:811)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
 {noformat}
 It looks like there is rawInputMap is being modified while elements are 
 removed from it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir

2013-03-11 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599098#comment-13599098
 ] 

Prashant Kommireddi commented on PIG-2591:
--

[~cheolsoo] I agree with your assessment. 

{quote}
I was wrong about setting pig.temp.dir in system properties in build.xml for 
unit tests. In fact, it won't be propagated to PigContext. So we should set 
pig.temp.dir in PigContext. {quote}
That was partly a cause for my confusion :). Setting a system property would 
not reflect in PigContext.

[~jarcec] apologies for the back-and-forth and thanks for your patience. But 
it's important we all agree on the best approach as this patch affects the way 
we write and run tests :)



 Unit tests should not write to /tmp but respect java.io.tmpdir
 --

 Key: PIG-2591
 URL: https://issues.apache.org/jira/browse/PIG-2591
 Project: Pig
  Issue Type: Bug
  Components: tools
Reporter: Thomas Weise
Assignee: Jarek Jarcec Cecho
 Fix For: 0.12

 Attachments: bugPIG-2591.patch, PIG-2495.patch


 Several tests use /tmp but should derive temporary file location from 
 java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test 
 run specific location in build.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3194) Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2

2013-03-11 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599120#comment-13599120
 ] 

Prashant Kommireddi commented on PIG-3194:
--

Dmitriy, let me know what you think?

 Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2
 ---

 Key: PIG-3194
 URL: https://issues.apache.org/jira/browse/PIG-3194
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Kai Londenberg
Assignee: Prashant Kommireddi
 Fix For: 0.11.1

 Attachments: PIG-3194.patch


 The changes to ObjectSerializer.java in the following commit
 http://svn.apache.org/viewvc?view=revisionrevision=1403934 break 
 compatibility with Hadoop 0.20.2 Clusters.
 The reason is, that the code uses methods from Apache Commons Codec 1.4 - 
 which are not available in Apache Commons Codec 1.3 which is shipping with 
 Hadoop 0.20.2.
 The offending methods are Base64.decodeBase64(String) and 
 Base64.encodeBase64URLSafeString(byte[])
 If I revert these changes, Pig 0.11.0 candidate 2 works well with our Hadoop 
 0.20.2 Clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3194) Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2

2013-03-11 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599169#comment-13599169
 ] 

Dmitriy V. Ryaboy commented on PIG-3194:


We can skip the test if we detect that we are on Hadoop 2.0 by using 
org.junit.Assume 

Let's do that so we don't have to come back and fix this later.

 Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2
 ---

 Key: PIG-3194
 URL: https://issues.apache.org/jira/browse/PIG-3194
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Kai Londenberg
Assignee: Prashant Kommireddi
 Fix For: 0.11.1

 Attachments: PIG-3194.patch


 The changes to ObjectSerializer.java in the following commit
 http://svn.apache.org/viewvc?view=revisionrevision=1403934 break 
 compatibility with Hadoop 0.20.2 Clusters.
 The reason is, that the code uses methods from Apache Commons Codec 1.4 - 
 which are not available in Apache Commons Codec 1.3 which is shipping with 
 Hadoop 0.20.2.
 The offending methods are Base64.decodeBase64(String) and 
 Base64.encodeBase64URLSafeString(byte[])
 If I revert these changes, Pig 0.11.0 candidate 2 works well with our Hadoop 
 0.20.2 Clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3215) [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files

2013-03-11 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599245#comment-13599245
 ] 

Jonathan Coveney commented on PIG-3215:
---

Sorry for the delay, was out of town. Will try to review in the next couple of 
days.

 [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files
 

 Key: PIG-3215
 URL: https://issues.apache.org/jira/browse/PIG-3215
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Reporter: MIYAKAWA Taku
Assignee: MIYAKAWA Taku
  Labels: piggybank
 Attachments: LTSVLoader-6.html, LTSVLoader.html, PIG-3215-6.patch, 
 PIG-3215.patch


 LTSV, or Labeled Tab-separated Values format is now getting popular in Japan 
 for log files, especially of web servers. The goal of this jira is to add 
 LTSVLoader in PiggyBank to load LTSV files.
 LTSV is based on TSV thus columns are separated by tab characters. 
 Additionally each of columns includes a label and a value, separated by : 
 character.
 Read about LTSV on http://ltsv.org/.
 h4. Example LTSV file (access.log)
 Columns are separated by tab characters.
 {noformat}
 host:host1.example.orgreq:GET /index.html ua:Opera/9.80
 host:host1.example.orgreq:GET /favicon.icoua:Opera/9.80
 host:pc.example.com   req:GET /news.html  ua:Mozilla/5.0
 {noformat}
 h4. Usage 1: Extract fields from each line
 Users can specify an input schema and get columns as Pig fields.
 This example loads the LTSV file shown in the previous section.
 {code}
 -- Parses the access log and count the number of lines
 -- for each pair of the host column and the ua column.
 access = LOAD 'access.log' USING 
 org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray');
 grouped_access = GROUP access BY (host, ua);
 count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua, 
 COUNT(access);
 DUMP count_for_host_ua;
 {code}
 The below text will be printed out.
 {noformat}
 (host1.example.org,Opera/9.80,2)
 (pc.example.com,Firefox/5.0,1)
 {noformat}
 h4. Usage 2: Extract a map from each line
 Users can get a map for each LTSV line. The key of a map is a label of the 
 LTSV column. The value of a map comes from characters after : in the LTSV 
 column.
 {code}
 -- Parses the access log and projects the user agent field.
 access = LOAD 'access.log' USING 
 org.apache.pig.piggybank.storage.LTSVLoader() AS (m:map[]);
 user_agent = FOREACH access GENERATE m#'ua' AS ua;
 DUMP user_agent;
 {code}
 The below text will be printed out.
 {noformat}
 (Opera/9.80)
 (Opera/9.80)
 (Firefox/5.0)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-03-11 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang updated PIG-3223:
--

Attachment: PIG-3223.patch.txt

 AvroStorage does not handle comma separated input paths
 ---

 Key: PIG-3223
 URL: https://issues.apache.org/jira/browse/PIG-3223
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.10.0, 0.11
Reporter: Michael Kramer
Assignee: Johnny Zhang
 Attachments: AvroStorage.patch, AvroStorage.patch-2, 
 AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt


 In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
 separated input paths (PIG-2492).  While this function works fine for 
 glob-formatted input paths, it fails when issued a standard comma separated 
 list of paths.  fs.globStatus does not seem to be able to parse out such a 
 list, and a java.net.URISyntaxException is thrown when toURI is called on the 
 path.  
 I have a working fix for this, but it's extremely ugly (basically checking if 
 the string of input paths is globbed, otherwise splitting on ,).  I'm sure 
 there's a more elegant solution.  I'd be happy to post the relevant methods 
 and fixes if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-03-11 Thread jira
Issue Subscription
Filter: PIG patch available (35 issues)

Subscriber: pigdaily

Key Summary
PIG-3244Make PIG_HOME configurable
https://issues.apache.org/jira/browse/PIG-3244
PIG-3238Pig current releases lack a UDF Stuff(). This UDF deletes a 
specified length of characters and inserts another set of characters at a 
specified starting point.
https://issues.apache.org/jira/browse/PIG-3238
PIG-3237Pig current releases lack a UDF MakeSet(). This UDF returns a set 
value (a string containing substrings separated by , characters) consisting 
of the strings that have the corresponding bit in the first argument
https://issues.apache.org/jira/browse/PIG-3237
PIG-3235Enable DEBUG log messages in unit tests by default
https://issues.apache.org/jira/browse/PIG-3235
PIG-3233Deploy a Piggybank Jar
https://issues.apache.org/jira/browse/PIG-3233
PIG-3215[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated 
Values) files
https://issues.apache.org/jira/browse/PIG-3215
PIG-3210Pig fails to start when it cannot write log to log files
https://issues.apache.org/jira/browse/PIG-3210
PIG-3208[zebra] TFile should not set io.compression.codec.lzo.buffersize
https://issues.apache.org/jira/browse/PIG-3208
PIG-3205Passing arguments to python script does not work with -f option
https://issues.apache.org/jira/browse/PIG-3205
PIG-3198Let users use any function from PigType - PigType as if it were 
builtlin
https://issues.apache.org/jira/browse/PIG-3198
PIG-3194Changes to ObjectSerializer.java break compatibility with Hadoop 
0.20.2
https://issues.apache.org/jira/browse/PIG-3194
PIG-3190Add LuceneTokenizer and SnowballTokenizer to Pig - useful text 
tokenization
https://issues.apache.org/jira/browse/PIG-3190
PIG-3183rm or rmf commands should respect globbing/regex of path
https://issues.apache.org/jira/browse/PIG-3183
PIG-3172Partition filter push down does not happen when there is a non 
partition key map column filter
https://issues.apache.org/jira/browse/PIG-3172
PIG-3166Update eclipse .classpath according to ivy library.properties
https://issues.apache.org/jira/browse/PIG-3166
PIG-3164Pig current releases lack a UDF endsWith.This UDF tests if a given 
string ends with the specified suffix.
https://issues.apache.org/jira/browse/PIG-3164
PIG-3141Giving CSVExcelStorage an option to handle header rows
https://issues.apache.org/jira/browse/PIG-3141
PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections
https://issues.apache.org/jira/browse/PIG-3123
PIG-3122Operators should not implicitly become reserved keywords
https://issues.apache.org/jira/browse/PIG-3122
PIG-3114Duplicated macro name error when using pigunit
https://issues.apache.org/jira/browse/PIG-3114
PIG-3105Fix TestJobSubmission unit test failure.
https://issues.apache.org/jira/browse/PIG-3105
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2643Use bytecode generation to make a performance replacement for 
InvokeForLong, InvokeForString, etc
https://issues.apache.org/jira/browse/PIG-2643
PIG-2641Create toJSON function for all complex types: tuples, bags and maps
https://issues.apache.org/jira/browse/PIG-2641
PIG-2591Unit tests should not write to /tmp but respect java.io.tmpdir
https://issues.apache.org/jira/browse/PIG-2591
PIG-1914Support load/store JSON data in Pig
https://issues.apache.org/jira/browse/PIG-1914

You may edit 

[jira] [Commented] (PIG-3239) Unable to return multiple values from a macro using SPLIT

2013-03-11 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599553#comment-13599553
 ] 

Johnny Zhang commented on PIG-3239:
---

[~luibelgo], please correct me if I am wrong, I think 'SPLIT' doesn't work with 
'OTHERWISE'. 'SPLIT' only work with 'IF' 
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT

actually below script works with me very well on your example
{noforamt}
DEFINE my_macro(seq) RETURNS valid, invalid { 
added = FOREACH $seq GENERATE $0 * 2, $1; 
SPLIT added INTO $valid IF $1 == true, $invalid IF $1 ==false;
}
data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean);
P, Q = my_macro(data);
DUMP P;
DUMP Q;
{noformat}

 Unable to return multiple values from a macro using SPLIT
 -

 Key: PIG-3239
 URL: https://issues.apache.org/jira/browse/PIG-3239
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.0
 Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) 
 compiled Feb 15 2013, 12:19:17
 Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 
 x86_64 x86_64 GNU/Linux
Reporter: Luis Belloch
Priority: Minor

 Hi, I'm unable to return multiple values from a macro when values come from a 
 SPLIT. Here is an small example:
 {code}
 DEFINE my_macro(seq) RETURNS valid, invalid {
 added = FOREACH $seq GENERATE $0 * 2, $1;
 SPLIT added INTO $valid IF $1 == true, $invalid OTHERWISE;
 }
 data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean);
 P, Q = my_macro(data);
 DUMP P;
 DUMP Q;
 {code}
 Pig is unable to recognize the {{OTHERWISE}} side. Error is: {{ERROR 
 org.apache.pig.tools.grunt.Grunt - ERROR 1200: at case.pig, line 3 Invalid 
 macro definition: . Reason: Macro 'my_macro' missing return alias: invalid}}
 Simple workaround is to force {{$invalid}} to be returned as {{FOREACH}} 
 result:
 {code}
 SPLIT added INTO $valid IF $1 == true, tmp_invalid OTHERWISE;
 $invalid = FOREACH tmp_invalid GENERATE *;
 {code}
 Samples and logs attached to the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-3239) Unable to return multiple values from a macro using SPLIT

2013-03-11 Thread Johnny Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Zhang reassigned PIG-3239:
-

Assignee: Johnny Zhang

 Unable to return multiple values from a macro using SPLIT
 -

 Key: PIG-3239
 URL: https://issues.apache.org/jira/browse/PIG-3239
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.0
 Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) 
 compiled Feb 15 2013, 12:19:17
 Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 
 x86_64 x86_64 GNU/Linux
Reporter: Luis Belloch
Assignee: Johnny Zhang
Priority: Minor

 Hi, I'm unable to return multiple values from a macro when values come from a 
 SPLIT. Here is an small example:
 {code}
 DEFINE my_macro(seq) RETURNS valid, invalid {
 added = FOREACH $seq GENERATE $0 * 2, $1;
 SPLIT added INTO $valid IF $1 == true, $invalid OTHERWISE;
 }
 data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean);
 P, Q = my_macro(data);
 DUMP P;
 DUMP Q;
 {code}
 Pig is unable to recognize the {{OTHERWISE}} side. Error is: {{ERROR 
 org.apache.pig.tools.grunt.Grunt - ERROR 1200: at case.pig, line 3 Invalid 
 macro definition: . Reason: Macro 'my_macro' missing return alias: invalid}}
 Simple workaround is to force {{$invalid}} to be returned as {{FOREACH}} 
 result:
 {code}
 SPLIT added INTO $valid IF $1 == true, tmp_invalid OTHERWISE;
 $invalid = FOREACH tmp_invalid GENERATE *;
 {code}
 Samples and logs attached to the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3239) Unable to return multiple values from a macro using SPLIT

2013-03-11 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599554#comment-13599554
 ] 

Johnny Zhang commented on PIG-3239:
---

sorry for the typo, I mean
{noformat}
DEFINE my_macro(seq) RETURNS valid, invalid
{ added = FOREACH $seq GENERATE $0 * 2, $1; SPLIT added INTO $valid IF $1 == 
true, $invalid IF $1 ==false; } data = LOAD 'case.csv' USING PigStorage(',') AS 
(value: int, valid: boolean); P, Q = my_macro(data); DUMP P; DUMP Q; 
{noformat}

 Unable to return multiple values from a macro using SPLIT
 -

 Key: PIG-3239
 URL: https://issues.apache.org/jira/browse/PIG-3239
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.0
 Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) 
 compiled Feb 15 2013, 12:19:17
 Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 
 x86_64 x86_64 GNU/Linux
Reporter: Luis Belloch
Assignee: Johnny Zhang
Priority: Minor

 Hi, I'm unable to return multiple values from a macro when values come from a 
 SPLIT. Here is an small example:
 {code}
 DEFINE my_macro(seq) RETURNS valid, invalid {
 added = FOREACH $seq GENERATE $0 * 2, $1;
 SPLIT added INTO $valid IF $1 == true, $invalid OTHERWISE;
 }
 data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean);
 P, Q = my_macro(data);
 DUMP P;
 DUMP Q;
 {code}
 Pig is unable to recognize the {{OTHERWISE}} side. Error is: {{ERROR 
 org.apache.pig.tools.grunt.Grunt - ERROR 1200: at case.pig, line 3 Invalid 
 macro definition: . Reason: Macro 'my_macro' missing return alias: invalid}}
 Simple workaround is to force {{$invalid}} to be returned as {{FOREACH}} 
 result:
 {code}
 SPLIT added INTO $valid IF $1 == true, tmp_invalid OTHERWISE;
 $invalid = FOREACH tmp_invalid GENERATE *;
 {code}
 Samples and logs attached to the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Contribute to PIG-3225

2013-03-11 Thread Dmitriy Ryaboy
+ Gianmarco


On Mon, Mar 11, 2013 at 11:20 AM, Sadari Jayawardena 
sjayawardena...@gmail.com wrote:

 I am a final year undergraduate in Computer Science  Engineering. I have a
 good experience in Java programming and interested in mathematics and
 statistics. I would like to contribute to this project through GSoC 2013. (
 https://issues.apache.org/jira/browse/PIG-3225)

 I went through the Wikipedia link provided. Could I be provided with
 additional references and study materials?


 Thanks in advance
 --
 Sadari Jayawardena

 Undergraduate
 Department of Computer Science  Engineering
 University of Moratuwa