[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir
[ https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598563#comment-13598563 ] Jarek Jarcec Cecho commented on PIG-2591: - Hi [~cheolsoo] and [~prkommireddi], thank you very much for your feedback. I agree with proposed steps and I'll be more than happy to execute them myself. I just have small comment, direct usage of {{FileLocalizer}} is not simple as it requires a lot of initialization (for example that class needs valid {{PigContext}}). Thus I would propose to move the actual logic of getting temporary directory into standalone easily reusable class {{TmpUtil}} (which is part of {{bugPIG-2591.patch}}). Jarcec Unit tests should not write to /tmp but respect java.io.tmpdir -- Key: PIG-2591 URL: https://issues.apache.org/jira/browse/PIG-2591 Project: Pig Issue Type: Bug Components: tools Reporter: Thomas Weise Assignee: Jarek Jarcec Cecho Fix For: 0.12 Attachments: bugPIG-2591.patch, PIG-2495.patch Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir
[ https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598564#comment-13598564 ] Cheolsoo Park commented on PIG-2591: I agree with Jarcec. Sorry if my comment wasn't clear enough. Unit tests should not write to /tmp but respect java.io.tmpdir -- Key: PIG-2591 URL: https://issues.apache.org/jira/browse/PIG-2591 Project: Pig Issue Type: Bug Components: tools Reporter: Thomas Weise Assignee: Jarek Jarcec Cecho Fix For: 0.12 Attachments: bugPIG-2591.patch, PIG-2495.patch Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir
[ https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598580#comment-13598580 ] Prashant Kommireddi commented on PIG-2591: -- Thanks guys. Here are a couple cases I am thinking about: 1. On Windows machine - java.io.tempdir could default to something like, say C:\DOCUME~1\username\LOCALS~1\Temp\. This is fine if we are running unit tests on a local machine but could have unexpected results when pig attempts to create that directory on HDFS for intermediate files? 2. Clarity - would it be better to separate these properties as pig.test.temp.dir and pig.temp.dir? It might be good (and expected) if we default pig.temp.dir to /tmp as that's been the case always. Jarcec, I agree on what you propose too. It makes sense to separate the helper methods into TmpUtil. The patch does not currently use java.io.tmpdir, how do you see that being used? Unit tests should not write to /tmp but respect java.io.tmpdir -- Key: PIG-2591 URL: https://issues.apache.org/jira/browse/PIG-2591 Project: Pig Issue Type: Bug Components: tools Reporter: Thomas Weise Assignee: Jarek Jarcec Cecho Fix For: 0.12 Attachments: bugPIG-2591.patch, PIG-2495.patch Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir
[ https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598600#comment-13598600 ] Cheolsoo Park commented on PIG-2591: [~prkommireddi], # If we set pig.temp.dir to ./build, it would work for both Linux and Windows. # I am not proposing to change the default value of pig.temp.dir. I am proposing to set it for unit tests only in build.xml. I don't think we need to introduce an additional property. Do you agree? Unit tests should not write to /tmp but respect java.io.tmpdir -- Key: PIG-2591 URL: https://issues.apache.org/jira/browse/PIG-2591 Project: Pig Issue Type: Bug Components: tools Reporter: Thomas Weise Assignee: Jarek Jarcec Cecho Fix For: 0.12 Attachments: bugPIG-2591.patch, PIG-2495.patch Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir
[ https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598607#comment-13598607 ] Prashant Kommireddi commented on PIG-2591: -- 1. Agreed (assuming we no longer are considering java.io.tmpdir and instead using ./build?) 2. I am a bit unclear on the reasons behind reusing pig.temp.dir. It's a bit confusing to me that pig.temp.dir is set differently for tests vs intermediate store at different places. Unit tests should not write to /tmp but respect java.io.tmpdir -- Key: PIG-2591 URL: https://issues.apache.org/jira/browse/PIG-2591 Project: Pig Issue Type: Bug Components: tools Reporter: Thomas Weise Assignee: Jarek Jarcec Cecho Fix For: 0.12 Attachments: bugPIG-2591.patch, PIG-2495.patch Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir
[ https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598620#comment-13598620 ] Cheolsoo Park commented on PIG-2591: PIG-2995 explains the reasoning. I think that the advantage of having a property such as pig.temp.dir is to use it on a case-by-case basis without having to changing code. So I don't understand why setting the property to different values for different purposes is confusing. As far as I understand, both PIG-2995 and PIG-2591 are trying to solve the same problem. When you have automated builds, you want to control where builds generate temporary files. In fact, there are two kinds of files that tests generate. As you pointed out, 1) intermediate files and 2) output files (e.g. STORE foo INTO '/tmp'). To me, controlling them with a single property sounds better rather than having two properties. Does this make sense? Unit tests should not write to /tmp but respect java.io.tmpdir -- Key: PIG-2591 URL: https://issues.apache.org/jira/browse/PIG-2591 Project: Pig Issue Type: Bug Components: tools Reporter: Thomas Weise Assignee: Jarek Jarcec Cecho Fix For: 0.12 Attachments: bugPIG-2591.patch, PIG-2495.patch Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3241) ConcurrentModificationException in POPartialAgg
[ https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598965#comment-13598965 ] Rohini Palaniswamy commented on PIG-3241: - [~dvryaboy], No. We are currently running on Pig 0.10 only. Pig 0.11 testing is blocked due to lack of QE resources to certify new features. {code} private MapObject, ListTuple rawInputMap = Maps.newHashMap(); private MapObject, ListTuple processedInputMap = Maps.newHashMap(); {code} Don't think it has to do with Hadoop 2.0. A ConcurrentHashMap should have been used in the code. Exception not visible in unit tests as spill does not happen there. ConcurrentModificationException in POPartialAgg --- Key: PIG-3241 URL: https://issues.apache.org/jira/browse/PIG-3241 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Lohit Vijayarenu Priority: Blocker Fix For: 0.12, 0.11.1 While running few PIG scripts against Hadoop 2.0, I see consistently see ConcurrentModificationException {noformat} at java.util.HashMap$HashIterator.remove(HashMap.java:811) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153) {noformat} It looks like there is rawInputMap is being modified while elements are removed from it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3241) ConcurrentModificationException in POPartialAgg
[ https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599008#comment-13599008 ] Dmitriy V. Ryaboy commented on PIG-3241: Thing is, we are running this in production on 1.0 and don't observe this error.. ConcurrentModificationException in POPartialAgg --- Key: PIG-3241 URL: https://issues.apache.org/jira/browse/PIG-3241 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Lohit Vijayarenu Priority: Blocker Fix For: 0.12, 0.11.1 While running few PIG scripts against Hadoop 2.0, I see consistently see ConcurrentModificationException {noformat} at java.util.HashMap$HashIterator.remove(HashMap.java:811) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153) {noformat} It looks like there is rawInputMap is being modified while elements are removed from it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir
[ https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599030#comment-13599030 ] Cheolsoo Park commented on PIG-2591: [~jarcec] and [~prkommireddi], I thought about this more last night. I withdraw my agreement on Jarcec's patch. * I don't think we want to have the following method: {code} public static String getTemporaryDirectory() { return getTemporaryDirectory(System.getProperties(), random); } {code} Let's say that we set pig.temp.dir to /x in system properties (e.g. sysproperty in build.xml). But FilLocalizer.getTemporaryPath() is not effected by this because pig.temp.dir is not set in PigContext. So Pig will still generate intermediate data under file:///tmp. Now if we do store foo into TmpUtil.getTemporaryDirectory(), unit tests will generate temporary files into two directories: output files into /x and intermediate files into /tmp. {quote} The direct usage of FileLocalizer is not simple as it requires a lot of initialization (for example that class needs valid PigContext). {quote} In fact, this is good because that's how we guarantee all the temporary files go into a single directory controlled by PigContext. So we should use FilLocalizer.getTemporaryPath(). * I was wrong about setting pig.temp.dir in system properties in build.xml for unit tests. In fact, it won't be propagated to PigContext. So we should set pig.temp.dir in PigContext. * All of these are only applicable to local mode (i.e. file://). If unit tests run on mini cluster, we shouldn't set pig.temp.dir. Currently, mini cluster already generates files under ./build, so we don't need to worry about them. * We should use File.createTempFile() for local temporary files that are not generated by Pig. (e.g. Some tests generate input files on local and copy them to mini cluster.) Then, java.io.tmpdir will be automatically honored. I think I covered all the cases. Please let me know what you think. Unit tests should not write to /tmp but respect java.io.tmpdir -- Key: PIG-2591 URL: https://issues.apache.org/jira/browse/PIG-2591 Project: Pig Issue Type: Bug Components: tools Reporter: Thomas Weise Assignee: Jarek Jarcec Cecho Fix For: 0.12 Attachments: bugPIG-2591.patch, PIG-2495.patch Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3241) ConcurrentModificationException in POPartialAgg
[ https://issues.apache.org/jira/browse/PIG-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599097#comment-13599097 ] Rohini Palaniswamy commented on PIG-3241: - I was wrong about iterator.remove() of HashMap throwing ConcurrentModificationException. Happens only while iterating directly over the entrySet() {code} for (EntryString, String entry : map.entrySet()) { map.remove(entry.getKey()); } {code} throws ConcurrentModificationException {code} IteratorEntryString, String spillingIterator = map.entrySet().iterator(); EntryString, String entry = spillingIterator.next(); spillingIterator.remove(); {code} does not throw ConcurrentModificationException So theoretically it should be fine if the map was accessed within a single thread. Need to investigate if there is more than one thread accessing POPartialAgg. Possible that it could be a 2.0 issue. Simple fix would be to change that to ConcurrentHashMap, but need to find the underlying cause to assess further impact. I am blocked with few deadlines before going on vacation and don't have time to spare for the investigation. I am fine if anyone else can take a look at this. Cheolsoo? [~lohit], Can you change it to ConcurrentHashMap and compile and see that fixes the problem for you? ConcurrentModificationException in POPartialAgg --- Key: PIG-3241 URL: https://issues.apache.org/jira/browse/PIG-3241 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Lohit Vijayarenu Priority: Blocker Fix For: 0.12, 0.11.1 While running few PIG scripts against Hadoop 2.0, I see consistently see ConcurrentModificationException {noformat} at java.util.HashMap$HashIterator.remove(HashMap.java:811) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregate(POPartialAgg.java:365) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.aggregateSecondLevel(POPartialAgg.java:379) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPartialAgg.getNext(POPartialAgg.java:203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153) {noformat} It looks like there is rawInputMap is being modified while elements are removed from it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir
[ https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599098#comment-13599098 ] Prashant Kommireddi commented on PIG-2591: -- [~cheolsoo] I agree with your assessment. {quote} I was wrong about setting pig.temp.dir in system properties in build.xml for unit tests. In fact, it won't be propagated to PigContext. So we should set pig.temp.dir in PigContext. {quote} That was partly a cause for my confusion :). Setting a system property would not reflect in PigContext. [~jarcec] apologies for the back-and-forth and thanks for your patience. But it's important we all agree on the best approach as this patch affects the way we write and run tests :) Unit tests should not write to /tmp but respect java.io.tmpdir -- Key: PIG-2591 URL: https://issues.apache.org/jira/browse/PIG-2591 Project: Pig Issue Type: Bug Components: tools Reporter: Thomas Weise Assignee: Jarek Jarcec Cecho Fix For: 0.12 Attachments: bugPIG-2591.patch, PIG-2495.patch Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3194) Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2
[ https://issues.apache.org/jira/browse/PIG-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599120#comment-13599120 ] Prashant Kommireddi commented on PIG-3194: -- Dmitriy, let me know what you think? Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2 --- Key: PIG-3194 URL: https://issues.apache.org/jira/browse/PIG-3194 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Kai Londenberg Assignee: Prashant Kommireddi Fix For: 0.11.1 Attachments: PIG-3194.patch The changes to ObjectSerializer.java in the following commit http://svn.apache.org/viewvc?view=revisionrevision=1403934 break compatibility with Hadoop 0.20.2 Clusters. The reason is, that the code uses methods from Apache Commons Codec 1.4 - which are not available in Apache Commons Codec 1.3 which is shipping with Hadoop 0.20.2. The offending methods are Base64.decodeBase64(String) and Base64.encodeBase64URLSafeString(byte[]) If I revert these changes, Pig 0.11.0 candidate 2 works well with our Hadoop 0.20.2 Clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3194) Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2
[ https://issues.apache.org/jira/browse/PIG-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599169#comment-13599169 ] Dmitriy V. Ryaboy commented on PIG-3194: We can skip the test if we detect that we are on Hadoop 2.0 by using org.junit.Assume Let's do that so we don't have to come back and fix this later. Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2 --- Key: PIG-3194 URL: https://issues.apache.org/jira/browse/PIG-3194 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Kai Londenberg Assignee: Prashant Kommireddi Fix For: 0.11.1 Attachments: PIG-3194.patch The changes to ObjectSerializer.java in the following commit http://svn.apache.org/viewvc?view=revisionrevision=1403934 break compatibility with Hadoop 0.20.2 Clusters. The reason is, that the code uses methods from Apache Commons Codec 1.4 - which are not available in Apache Commons Codec 1.3 which is shipping with Hadoop 0.20.2. The offending methods are Base64.decodeBase64(String) and Base64.encodeBase64URLSafeString(byte[]) If I revert these changes, Pig 0.11.0 candidate 2 works well with our Hadoop 0.20.2 Clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3215) [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files
[ https://issues.apache.org/jira/browse/PIG-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599245#comment-13599245 ] Jonathan Coveney commented on PIG-3215: --- Sorry for the delay, was out of town. Will try to review in the next couple of days. [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files Key: PIG-3215 URL: https://issues.apache.org/jira/browse/PIG-3215 Project: Pig Issue Type: New Feature Components: piggybank Reporter: MIYAKAWA Taku Assignee: MIYAKAWA Taku Labels: piggybank Attachments: LTSVLoader-6.html, LTSVLoader.html, PIG-3215-6.patch, PIG-3215.patch LTSV, or Labeled Tab-separated Values format is now getting popular in Japan for log files, especially of web servers. The goal of this jira is to add LTSVLoader in PiggyBank to load LTSV files. LTSV is based on TSV thus columns are separated by tab characters. Additionally each of columns includes a label and a value, separated by : character. Read about LTSV on http://ltsv.org/. h4. Example LTSV file (access.log) Columns are separated by tab characters. {noformat} host:host1.example.orgreq:GET /index.html ua:Opera/9.80 host:host1.example.orgreq:GET /favicon.icoua:Opera/9.80 host:pc.example.com req:GET /news.html ua:Mozilla/5.0 {noformat} h4. Usage 1: Extract fields from each line Users can specify an input schema and get columns as Pig fields. This example loads the LTSV file shown in the previous section. {code} -- Parses the access log and count the number of lines -- for each pair of the host column and the ua column. access = LOAD 'access.log' USING org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray'); grouped_access = GROUP access BY (host, ua); count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua, COUNT(access); DUMP count_for_host_ua; {code} The below text will be printed out. {noformat} (host1.example.org,Opera/9.80,2) (pc.example.com,Firefox/5.0,1) {noformat} h4. Usage 2: Extract a map from each line Users can get a map for each LTSV line. The key of a map is a label of the LTSV column. The value of a map comes from characters after : in the LTSV column. {code} -- Parses the access log and projects the user agent field. access = LOAD 'access.log' USING org.apache.pig.piggybank.storage.LTSVLoader() AS (m:map[]); user_agent = FOREACH access GENERATE m#'ua' AS ua; DUMP user_agent; {code} The below text will be printed out. {noformat} (Opera/9.80) (Opera/9.80) (Firefox/5.0) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths
[ https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang updated PIG-3223: -- Attachment: PIG-3223.patch.txt AvroStorage does not handle comma separated input paths --- Key: PIG-3223 URL: https://issues.apache.org/jira/browse/PIG-3223 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.10.0, 0.11 Reporter: Michael Kramer Assignee: Johnny Zhang Attachments: AvroStorage.patch, AvroStorage.patch-2, AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt In pig 0.11, a patch was issued to AvroStorage to support globs and comma separated input paths (PIG-2492). While this function works fine for glob-formatted input paths, it fails when issued a standard comma separated list of paths. fs.globStatus does not seem to be able to parse out such a list, and a java.net.URISyntaxException is thrown when toURI is called on the path. I have a working fix for this, but it's extremely ugly (basically checking if the string of input paths is globbed, otherwise splitting on ,). I'm sure there's a more elegant solution. I'd be happy to post the relevant methods and fixes if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (35 issues) Subscriber: pigdaily Key Summary PIG-3244Make PIG_HOME configurable https://issues.apache.org/jira/browse/PIG-3244 PIG-3238Pig current releases lack a UDF Stuff(). This UDF deletes a specified length of characters and inserts another set of characters at a specified starting point. https://issues.apache.org/jira/browse/PIG-3238 PIG-3237Pig current releases lack a UDF MakeSet(). This UDF returns a set value (a string containing substrings separated by , characters) consisting of the strings that have the corresponding bit in the first argument https://issues.apache.org/jira/browse/PIG-3237 PIG-3235Enable DEBUG log messages in unit tests by default https://issues.apache.org/jira/browse/PIG-3235 PIG-3233Deploy a Piggybank Jar https://issues.apache.org/jira/browse/PIG-3233 PIG-3215[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files https://issues.apache.org/jira/browse/PIG-3215 PIG-3210Pig fails to start when it cannot write log to log files https://issues.apache.org/jira/browse/PIG-3210 PIG-3208[zebra] TFile should not set io.compression.codec.lzo.buffersize https://issues.apache.org/jira/browse/PIG-3208 PIG-3205Passing arguments to python script does not work with -f option https://issues.apache.org/jira/browse/PIG-3205 PIG-3198Let users use any function from PigType - PigType as if it were builtlin https://issues.apache.org/jira/browse/PIG-3198 PIG-3194Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2 https://issues.apache.org/jira/browse/PIG-3194 PIG-3190Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization https://issues.apache.org/jira/browse/PIG-3190 PIG-3183rm or rmf commands should respect globbing/regex of path https://issues.apache.org/jira/browse/PIG-3183 PIG-3172Partition filter push down does not happen when there is a non partition key map column filter https://issues.apache.org/jira/browse/PIG-3172 PIG-3166Update eclipse .classpath according to ivy library.properties https://issues.apache.org/jira/browse/PIG-3166 PIG-3164Pig current releases lack a UDF endsWith.This UDF tests if a given string ends with the specified suffix. https://issues.apache.org/jira/browse/PIG-3164 PIG-3141Giving CSVExcelStorage an option to handle header rows https://issues.apache.org/jira/browse/PIG-3141 PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections https://issues.apache.org/jira/browse/PIG-3123 PIG-3122Operators should not implicitly become reserved keywords https://issues.apache.org/jira/browse/PIG-3122 PIG-3114Duplicated macro name error when using pigunit https://issues.apache.org/jira/browse/PIG-3114 PIG-3105Fix TestJobSubmission unit test failure. https://issues.apache.org/jira/browse/PIG-3105 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness https://issues.apache.org/jira/browse/PIG-3069 PIG-3028testGrunt dev test needs some command filters to run correctly without cygwin https://issues.apache.org/jira/browse/PIG-3028 PIG-3027pigTest unit test needs a newline filter for comparisons of golden multi-line https://issues.apache.org/jira/browse/PIG-3027 PIG-3026Pig checked-in baseline comparisons need a pre-filter to address OS-specific newline differences https://issues.apache.org/jira/browse/PIG-3026 PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is brittle https://issues.apache.org/jira/browse/PIG-3024 PIG-3015Rewrite of AvroStorage https://issues.apache.org/jira/browse/PIG-3015 PIG-3010Allow UDF's to flatten themselves https://issues.apache.org/jira/browse/PIG-3010 PIG-2959Add a pig.cmd for Pig to run under Windows https://issues.apache.org/jira/browse/PIG-2959 PIG-2955 Fix bunch of Pig e2e tests on Windows https://issues.apache.org/jira/browse/PIG-2955 PIG-2643Use bytecode generation to make a performance replacement for InvokeForLong, InvokeForString, etc https://issues.apache.org/jira/browse/PIG-2643 PIG-2641Create toJSON function for all complex types: tuples, bags and maps https://issues.apache.org/jira/browse/PIG-2641 PIG-2591Unit tests should not write to /tmp but respect java.io.tmpdir https://issues.apache.org/jira/browse/PIG-2591 PIG-1914Support load/store JSON data in Pig https://issues.apache.org/jira/browse/PIG-1914 You may edit
[jira] [Commented] (PIG-3239) Unable to return multiple values from a macro using SPLIT
[ https://issues.apache.org/jira/browse/PIG-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599553#comment-13599553 ] Johnny Zhang commented on PIG-3239: --- [~luibelgo], please correct me if I am wrong, I think 'SPLIT' doesn't work with 'OTHERWISE'. 'SPLIT' only work with 'IF' http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT actually below script works with me very well on your example {noforamt} DEFINE my_macro(seq) RETURNS valid, invalid { added = FOREACH $seq GENERATE $0 * 2, $1; SPLIT added INTO $valid IF $1 == true, $invalid IF $1 ==false; } data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean); P, Q = my_macro(data); DUMP P; DUMP Q; {noformat} Unable to return multiple values from a macro using SPLIT - Key: PIG-3239 URL: https://issues.apache.org/jira/browse/PIG-3239 Project: Pig Issue Type: Bug Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) compiled Feb 15 2013, 12:19:17 Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Reporter: Luis Belloch Priority: Minor Hi, I'm unable to return multiple values from a macro when values come from a SPLIT. Here is an small example: {code} DEFINE my_macro(seq) RETURNS valid, invalid { added = FOREACH $seq GENERATE $0 * 2, $1; SPLIT added INTO $valid IF $1 == true, $invalid OTHERWISE; } data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean); P, Q = my_macro(data); DUMP P; DUMP Q; {code} Pig is unable to recognize the {{OTHERWISE}} side. Error is: {{ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: at case.pig, line 3 Invalid macro definition: . Reason: Macro 'my_macro' missing return alias: invalid}} Simple workaround is to force {{$invalid}} to be returned as {{FOREACH}} result: {code} SPLIT added INTO $valid IF $1 == true, tmp_invalid OTHERWISE; $invalid = FOREACH tmp_invalid GENERATE *; {code} Samples and logs attached to the issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3239) Unable to return multiple values from a macro using SPLIT
[ https://issues.apache.org/jira/browse/PIG-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang reassigned PIG-3239: - Assignee: Johnny Zhang Unable to return multiple values from a macro using SPLIT - Key: PIG-3239 URL: https://issues.apache.org/jira/browse/PIG-3239 Project: Pig Issue Type: Bug Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) compiled Feb 15 2013, 12:19:17 Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Reporter: Luis Belloch Assignee: Johnny Zhang Priority: Minor Hi, I'm unable to return multiple values from a macro when values come from a SPLIT. Here is an small example: {code} DEFINE my_macro(seq) RETURNS valid, invalid { added = FOREACH $seq GENERATE $0 * 2, $1; SPLIT added INTO $valid IF $1 == true, $invalid OTHERWISE; } data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean); P, Q = my_macro(data); DUMP P; DUMP Q; {code} Pig is unable to recognize the {{OTHERWISE}} side. Error is: {{ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: at case.pig, line 3 Invalid macro definition: . Reason: Macro 'my_macro' missing return alias: invalid}} Simple workaround is to force {{$invalid}} to be returned as {{FOREACH}} result: {code} SPLIT added INTO $valid IF $1 == true, tmp_invalid OTHERWISE; $invalid = FOREACH tmp_invalid GENERATE *; {code} Samples and logs attached to the issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3239) Unable to return multiple values from a macro using SPLIT
[ https://issues.apache.org/jira/browse/PIG-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599554#comment-13599554 ] Johnny Zhang commented on PIG-3239: --- sorry for the typo, I mean {noformat} DEFINE my_macro(seq) RETURNS valid, invalid { added = FOREACH $seq GENERATE $0 * 2, $1; SPLIT added INTO $valid IF $1 == true, $invalid IF $1 ==false; } data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean); P, Q = my_macro(data); DUMP P; DUMP Q; {noformat} Unable to return multiple values from a macro using SPLIT - Key: PIG-3239 URL: https://issues.apache.org/jira/browse/PIG-3239 Project: Pig Issue Type: Bug Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-cdh4.2.0 (rexported) compiled Feb 15 2013, 12:19:17 Linux 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Reporter: Luis Belloch Assignee: Johnny Zhang Priority: Minor Hi, I'm unable to return multiple values from a macro when values come from a SPLIT. Here is an small example: {code} DEFINE my_macro(seq) RETURNS valid, invalid { added = FOREACH $seq GENERATE $0 * 2, $1; SPLIT added INTO $valid IF $1 == true, $invalid OTHERWISE; } data = LOAD 'case.csv' USING PigStorage(',') AS (value: int, valid: boolean); P, Q = my_macro(data); DUMP P; DUMP Q; {code} Pig is unable to recognize the {{OTHERWISE}} side. Error is: {{ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: at case.pig, line 3 Invalid macro definition: . Reason: Macro 'my_macro' missing return alias: invalid}} Simple workaround is to force {{$invalid}} to be returned as {{FOREACH}} result: {code} SPLIT added INTO $valid IF $1 == true, tmp_invalid OTHERWISE; $invalid = FOREACH tmp_invalid GENERATE *; {code} Samples and logs attached to the issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Contribute to PIG-3225
+ Gianmarco On Mon, Mar 11, 2013 at 11:20 AM, Sadari Jayawardena sjayawardena...@gmail.com wrote: I am a final year undergraduate in Computer Science Engineering. I have a good experience in Java programming and interested in mathematics and statistics. I would like to contribute to this project through GSoC 2013. ( https://issues.apache.org/jira/browse/PIG-3225) I went through the Wikipedia link provided. Could I be provided with additional references and study materials? Thanks in advance -- Sadari Jayawardena Undergraduate Department of Computer Science Engineering University of Moratuwa