[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546714#comment-13546714 ] Scott Carey commented on PIG-3015: -- Try corrupting the file at a point inside the data block instead of inside the sync marker. The ability to recover from a corrupted file was added in response to corrupted data, not corrupted sync. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3015: --- Attachment: good.avro bad.avro Test.java TestInput.java Hi Scott, Thank you very much. That makes sense. After several tries and errors, I managed to correctly corrupt a data block and was able to verify the recovery. The output from 'java-tool.jar tojson bad.avro' is as follows: {code} Caused by: java.io.IOException: Block read partially, the data may be corrupt at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:194) ... 3 more {code} The output from my test program is as follows: {code} next(): 685 tell(): 8196 next(): 686 tell(): 8196 hasNext() or next() failed tell(): 8240 next(): 2656 tell(): 16432 next(): 2657 tell(): 16432 {code} The data are sequential integers (0 ~ 1M). Here is the number of lost integers due to a single corrupted data block with different sync intervals: ||Sync interval in bytes||Num. of lost values|| |32|1970| |16,000|5389| In summary, * Avro can recover from a data block corruption but cannot from a sync marker corruption. * The amount of data loss depends on the sync interval. By default, it's 16KB, but it can vary from 32 to 2^30 bytes. The greater the sync interval is, the more data loss is. I am attaching my test program and input files if anyone's interested. Thanks! Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3015: --- Attachment: (was: Test.java) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3015: --- Attachment: (was: TestInput.java) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2433) Jython import module not working if module path is in classpath
[ https://issues.apache.org/jira/browse/PIG-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547050#comment-13547050 ] Cheolsoo Park commented on PIG-2433: +1. Thanks for the fix. The test passes for me too. I also ran e2e test and found no failure. Minor comment: When you commit the patch, can you remove a tab char in the following line? {code} + !-- Remove jython jar from mrapp-generated-classpath -- {code} Jython import module not working if module path is in classpath --- Key: PIG-2433 URL: https://issues.apache.org/jira/browse/PIG-2433 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10.0 Reporter: Daniel Dai Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: bad.log, good.log, PIG-2433-1.patch, PIG-2433.patch, TEST-org.apache.pig.test.TestScriptUDF.txt This is a hole of PIG-1824. If the path of python module is in classpath, job die with the message could not instantiate 'org.apache.pig.scripting.jython.JythonFunction'. Here is my observation: If the path of python module is in classpath, fileEntry we got in JythonScriptEngine:236 is __pyclasspath__/script$py.class instead of the script itself. Thus we cannot locate the script and skip the script in job.xml. For example: {code} register 'scriptB.py' using org.apache.pig.scripting.jython.JythonScriptEngine as pig A = LOAD 'table_testPythonNestedImport' as (a0:long, a1:long); B = foreach A generate pig.square(a0); dump B; scriptB.py: #!/usr/bin/python import scriptA @outputSchema(x:{t:(num:double)}) def sqrt(number): return (number ** .5) @outputSchema(x:{t:(num:long)}) def square(number): return long(scriptA.square(number)) scriptA.py: #!/usr/bin/python def square(number): return (number * number) {code} When we register scriptB.py, we use jython library to figure out the dependent modules scriptB relies on, in this case, scriptA. However, if current directory is in classpath, instead of scriptA.py, we get __pyclasspath__/scriptA.class. Then we try to put __pyclasspath__/script$py.class into job.jar, Pig complains __pyclasspath__/script$py.class does not exist. This is exactly TestScriptUDF.testPythonNestedImport is doing. In hadoop 20.x, the test still success because MiniCluster will take local classpath so it can still find scriptA.py even if it is not in job.jar. However, the script will fail in real cluster and MiniMRYarnCluster of hadoop 23. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547085#comment-13547085 ] Alan Gates commented on PIG-2769: - When I do a clean first as Cheolsoo advises it works, though I don't fully understand that since I started out with a clean checkout. In the system tests NegForeach_7, NegForeach_9, SyntaxErrors_4, Macro_Error_4 all fail because the error messages have changed. You can find these in test/e2e/pig/tests/negative.conf and macro.conf. Search on each of the group names (NegForeach, ...) and then find the test number under that. In each case you can run the query and change the expected error message to match the new one. Other than that, +1, patch looks good. a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Assignee: Nick White Fix For: 0.12 Attachments: case1.tar, PIG-2769.0.patch, TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547105#comment-13547105 ] Rohini Palaniswamy commented on PIG-2769: - bq. When I do a clean first as Cheolsoo advises it works, though I don't fully understand that since I started out with a clean checkout. Order of tests run must have been the cause. It might have passed after doing a clean because you might have just run TestInputSizeReducerEstimator without running other tests. TestInputSizeReducerEstimator needs to be fixed to do new Configuration(false); instead of new Configuration(); which makes it pick up hadoop-site.xml from a previously run test. a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Assignee: Nick White Fix For: 0.12 Attachments: case1.tar, PIG-2769.0.patch, TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick White updated PIG-2769: Attachment: PIG-2769.1.patch Thanks! I've attached a version of the patch which fixes the e2e tests you mentioned. a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Assignee: Nick White Fix For: 0.12 Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-2769: Attachment: PIG-2769.2.patch Two small changes from the last patch. I fixed one issue in negative.conf where there was a instead of . Also changed TestInputSizeReducerEstimator as suggested by Rohini which fixed the unit test issue (thanks Rohini). a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Assignee: Nick White Fix For: 0.12 Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, PIG-2769.2.patch, TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-2769: Resolution: Fixed Status: Resolved (was: Patch Available) Patch checked in. Thanks Nick. a simple logic causes very long compiling time on pig 0.10.0 Key: PIG-2769 URL: https://issues.apache.org/jira/browse/PIG-2769 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.10.0 Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) Reporter: Dan Li Assignee: Nick White Fix For: 0.12 Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, PIG-2769.2.patch, TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt We found the following simple logic will cause very long compiling time for pig 0.10.0, while using pig 0.8.1, everything is fine. A = load 'A.txt' using PigStorage() AS (m: int); B = FOREACH A { days_str = (chararray) (m == 1 ? 31: (m == 2 ? 28: (m == 3 ? 31: (m == 4 ? 30: (m == 5 ? 31: (m == 6 ? 30: (m == 7 ? 31: (m == 8 ? 31: (m == 9 ? 30: (m == 10 ? 31: (m == 11 ? 30:31))); GENERATE days_str as days_str; } store B into 'B'; and here's a simple input file example: A.txt 1 2 3 The pig version we used in the test Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (35 issues) Subscriber: pigdaily Key Summary PIG-3115Distinct Build-in Function Doesn't Handle Null Bags https://issues.apache.org/jira/browse/PIG-3115 PIG-3108HBaseStorage returns empty maps when mixing wildcard- with other columns https://issues.apache.org/jira/browse/PIG-3108 PIG-3105Fix TestJobSubmission unit test failure. https://issues.apache.org/jira/browse/PIG-3105 PIG-3098Add another test for the self join case https://issues.apache.org/jira/browse/PIG-3098 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3086Allow A Prefix To Be Added To URIs In PigUnit Tests https://issues.apache.org/jira/browse/PIG-3086 PIG-3078Make a UDF that, given a string, returns just the columns prefixed by that string https://issues.apache.org/jira/browse/PIG-3078 PIG-3073POUserFunc creating log spam for large scripts https://issues.apache.org/jira/browse/PIG-3073 PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness https://issues.apache.org/jira/browse/PIG-3069 PIG-3057make readField protected to be able to override it if we extend PigStorage https://issues.apache.org/jira/browse/PIG-3057 PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for cross-platform execution https://issues.apache.org/jira/browse/PIG-3029 PIG-3028testGrunt dev test needs some command filters to run correctly without cygwin https://issues.apache.org/jira/browse/PIG-3028 PIG-3027pigTest unit test needs a newline filter for comparisons of golden multi-line https://issues.apache.org/jira/browse/PIG-3027 PIG-3026Pig checked-in baseline comparisons need a pre-filter to address OS-specific newline differences https://issues.apache.org/jira/browse/PIG-3026 PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline script needs simplification https://issues.apache.org/jira/browse/PIG-3025 PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is brittle https://issues.apache.org/jira/browse/PIG-3024 PIG-3015Rewrite of AvroStorage https://issues.apache.org/jira/browse/PIG-3015 PIG-3010Allow UDF's to flatten themselves https://issues.apache.org/jira/browse/PIG-3010 PIG-2959Add a pig.cmd for Pig to run under Windows https://issues.apache.org/jira/browse/PIG-2959 PIG-2957TetsScriptUDF fail due to volume prefix in jar https://issues.apache.org/jira/browse/PIG-2957 PIG-2956Invalid cache specification for some streaming statement https://issues.apache.org/jira/browse/PIG-2956 PIG-2955 Fix bunch of Pig e2e tests on Windows https://issues.apache.org/jira/browse/PIG-2955 PIG-2878Pig current releases lack a UDF equalIgnoreCase.This function returns a Boolean value indicating whether string left is equal to string right. This check is case insensitive. https://issues.apache.org/jira/browse/PIG-2878 PIG-2873Converting bin/pig shell script to python https://issues.apache.org/jira/browse/PIG-2873 PIG-2834MultiStorage requires unused constructor argument https://issues.apache.org/jira/browse/PIG-2834 PIG-2824Pushing checking number of fields into LoadFunc https://issues.apache.org/jira/browse/PIG-2824 PIG-2788improved string interpolation of variables https://issues.apache.org/jira/browse/PIG-2788 PIG-2661Pig uses an extra job for loading data in Pigmix L9 https://issues.apache.org/jira/browse/PIG-2661 PIG-2645PigSplit does not handle the case where SerializationFactory returns null https://issues.apache.org/jira/browse/PIG-2645 PIG-2507Semicolon in paramenters for UDF results in parsing error https://issues.apache.org/jira/browse/PIG-2507 PIG-2433Jython import module not working if module path is in classpath https://issues.apache.org/jira/browse/PIG-2433 PIG-2417Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation. https://issues.apache.org/jira/browse/PIG-2417 PIG-2312NPE when relation and column share the same name and used in Nested Foreach https://issues.apache.org/jira/browse/PIG-2312 PIG-1942script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects https://issues.apache.org/jira/browse/PIG-1942 PIG-1237Piggybank MutliStorage - specify field to write in output https://issues.apache.org/jira/browse/PIG-1237 You may edit this subscription at:
[jira] [Updated] (PIG-2433) Jython import module not working if module path is in classpath
[ https://issues.apache.org/jira/browse/PIG-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-2433: Resolution: Fixed Status: Resolved (was: Patch Available) Thanks for the review Cheolsoo. Removed the tab before committing. Committed to trunk. Jython import module not working if module path is in classpath --- Key: PIG-2433 URL: https://issues.apache.org/jira/browse/PIG-2433 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10.0 Reporter: Daniel Dai Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: bad.log, good.log, PIG-2433-1.patch, PIG-2433.patch, TEST-org.apache.pig.test.TestScriptUDF.txt This is a hole of PIG-1824. If the path of python module is in classpath, job die with the message could not instantiate 'org.apache.pig.scripting.jython.JythonFunction'. Here is my observation: If the path of python module is in classpath, fileEntry we got in JythonScriptEngine:236 is __pyclasspath__/script$py.class instead of the script itself. Thus we cannot locate the script and skip the script in job.xml. For example: {code} register 'scriptB.py' using org.apache.pig.scripting.jython.JythonScriptEngine as pig A = LOAD 'table_testPythonNestedImport' as (a0:long, a1:long); B = foreach A generate pig.square(a0); dump B; scriptB.py: #!/usr/bin/python import scriptA @outputSchema(x:{t:(num:double)}) def sqrt(number): return (number ** .5) @outputSchema(x:{t:(num:long)}) def square(number): return long(scriptA.square(number)) scriptA.py: #!/usr/bin/python def square(number): return (number * number) {code} When we register scriptB.py, we use jython library to figure out the dependent modules scriptB relies on, in this case, scriptA. However, if current directory is in classpath, instead of scriptA.py, we get __pyclasspath__/scriptA.class. Then we try to put __pyclasspath__/script$py.class into job.jar, Pig complains __pyclasspath__/script$py.class does not exist. This is exactly TestScriptUDF.testPythonNestedImport is doing. In hadoop 20.x, the test still success because MiniCluster will take local classpath so it can still find scriptA.py even if it is not in job.jar. However, the script will fail in real cluster and MiniMRYarnCluster of hadoop 23. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3117) A debug mode in which pig does not delete temporary files
[ https://issues.apache.org/jira/browse/PIG-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547519#comment-13547519 ] Daniel Dai commented on PIG-3117: - Pig intermediate file is not snappy. By default it is InterStorage. For your request: 1. It should be fairly easy to retain temp files, just don't call FileLocalizer.deleteTempFiles() in Main 2. To retain plain text, you may need to change Utils.getTmpFileCompressorName, not sure if that's enough. Another approach is to write a decoder which invoke InterStorage to decode the tmp files A debug mode in which pig does not delete temporary files - Key: PIG-3117 URL: https://issues.apache.org/jira/browse/PIG-3117 Project: Pig Issue Type: Wish Affects Versions: 0.10.0 Reporter: Ido Hadanny when we debug our pig jobs on pre-production data, we usually find bugs we couldn't detect in our UT, as env and data are not quite the same. when the final output of a script is not quite what we expect, we start divide-and-conquer, running it line by line and inspecting the intermediate output of each stage. It would be great if we could simply configure pig not to delete the intermediate MR outputs, and store them as plaintext instead of snappy format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3114) Duplicated macro name error when using pigunit
[ https://issues.apache.org/jira/browse/PIG-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547534#comment-13547534 ] Daniel Dai commented on PIG-3114: - Runs good for me with 0.10.1. Which version are you using? Duplicated macro name error when using pigunit -- Key: PIG-3114 URL: https://issues.apache.org/jira/browse/PIG-3114 Project: Pig Issue Type: Bug Components: parser Reporter: Chetan Nadgire I'm using PigUnit to test a pig script within which a macro is defined. Pig runs fine on cluster but getting parsing error with pigunit. So I tried very basic pig script with macro and getting similar error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. line 9 null. Reason: Duplicated macro name 'my_macro_1' at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546) at org.apache.pig.PigServer.registerQuery(PigServer.java:516) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:988) at org.apache.pig.pigunit.pig.GruntParser.processPig(GruntParser.java:61) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194) at org.apache.pig.pigunit.pig.PigServer.registerScript(PigServer.java:56) at org.apache.pig.pigunit.PigTest.registerScript(PigTest.java:160) at org.apache.pig.pigunit.PigTest.assertOutput(PigTest.java:231) at org.apache.pig.pigunit.PigTest.assertOutput(PigTest.java:261) at FirstPigTest.MyPigTest.testTop2Queries(MyPigTest.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:176) at junit.framework.TestCase.runBare(TestCase.java:141) at junit.framework.TestResult$1.protect(TestResult.java:122) at junit.framework.TestResult.runProtected(TestResult.java:142) at junit.framework.TestResult.run(TestResult.java:125) at junit.framework.TestCase.run(TestCase.java:129) at junit.framework.TestSuite.runTest(TestSuite.java:255) at junit.framework.TestSuite.run(TestSuite.java:250) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: Failed to parse: line 9 null. Reason: Duplicated macro name 'my_macro_1' at org.apache.pig.parser.QueryParserDriver.makeMacroDef(QueryParserDriver.java:406) at org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java:277) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:178) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599) ... 30 more Pig script which is failing : {code:title=test.pig|borderStyle=solid} DEFINE my_macro_1 (QUERY, A) RETURNS C { $C = ORDER $QUERY BY total DESC, $A; } ; data = LOAD 'input' AS (query:CHARARRAY); queries_group = GROUP data BY query; queries_count = FOREACH queries_group GENERATE group AS query, COUNT(data) AS total; queries_ordered = my_macro_1(queries_count, query); queries_limit = LIMIT queries_ordered 2; STORE queries_limit INTO 'output'; {code} If I remove macro pigunit works fine. Even just defining macro without using it results in parsing error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3115) Distinct Build-in Function Doesn't Handle Null Bags
[ https://issues.apache.org/jira/browse/PIG-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547543#comment-13547543 ] Daniel Dai commented on PIG-3115: - Looks good. Only have one question: in getDistinctFromNestedBags, do we need null check if Initial always generate legitimate bag? Is it just for bullet proof? Distinct Build-in Function Doesn't Handle Null Bags --- Key: PIG-3115 URL: https://issues.apache.org/jira/browse/PIG-3115 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10.0 Reporter: Nick White Assignee: Nick White Attachments: PIG-3115.1.patch Calling Distinct(NULL) throws NPEs - it should handle this more gracefully. The attached patch makes Distinct(NULL) == {}, although it could return NULL. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3115) Distinct Build-in Function Doesn't Handle Null Bags
[ https://issues.apache.org/jira/browse/PIG-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547597#comment-13547597 ] Nick White commented on PIG-3115: - Yes, it's just defensive programming; I hit one of the NullPointerExceptions when writing a Pig script, so when I wrote the unit test to track down where it came from I added tests for all three static classes of Distinct...and so came across it then. Distinct Build-in Function Doesn't Handle Null Bags --- Key: PIG-3115 URL: https://issues.apache.org/jira/browse/PIG-3115 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10.0 Reporter: Nick White Assignee: Nick White Attachments: PIG-3115.1.patch Calling Distinct(NULL) throws NPEs - it should handle this more gracefully. The attached patch makes Distinct(NULL) == {}, although it could return NULL. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3115) Distinct Build-in Function Doesn't Handle Null Bags
[ https://issues.apache.org/jira/browse/PIG-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3115: Fix Version/s: 0.12 Distinct Build-in Function Doesn't Handle Null Bags --- Key: PIG-3115 URL: https://issues.apache.org/jira/browse/PIG-3115 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10.0 Reporter: Nick White Assignee: Nick White Fix For: 0.12 Attachments: PIG-3115.1.patch Calling Distinct(NULL) throws NPEs - it should handle this more gracefully. The attached patch makes Distinct(NULL) == {}, although it could return NULL. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3115) Distinct Build-in Function Doesn't Handle Null Bags
[ https://issues.apache.org/jira/browse/PIG-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3115: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Patch committed to trunk. Thanks Nick! Distinct Build-in Function Doesn't Handle Null Bags --- Key: PIG-3115 URL: https://issues.apache.org/jira/browse/PIG-3115 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10.0 Reporter: Nick White Assignee: Nick White Attachments: PIG-3115.1.patch Calling Distinct(NULL) throws NPEs - it should handle this more gracefully. The attached patch makes Distinct(NULL) == {}, although it could return NULL. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3114) Duplicated macro name error when using pigunit
[ https://issues.apache.org/jira/browse/PIG-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547657#comment-13547657 ] Chetan Nadgire commented on PIG-3114: - Hi Daniel, I am building pig.jar and pigunit.jar from branch-0.11. When I tried branch-0.10 getting different error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. null at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549) at org.apache.pig.PigServer.registerQuery(PigServer.java:549) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:968) at org.apache.pig.pigunit.pig.GruntParser.processPig(GruntParser.java:61) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190) at org.apache.pig.pigunit.pig.PigServer.registerScript(PigServer.java:53) at org.apache.pig.pigunit.PigTest.registerScript(PigTest.java:160) at org.apache.pig.pigunit.PigTest.assertOutput(PigTest.java:251) at FirstPigTest.MyPigTest.testTop2Queries(MyPigTest.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.NullPointerException at java.io.File.init(File.java:222) at org.apache.pig.parser.QueryParserUtils.getFileFromImportSearchPath(QueryParserUtils.java:205) at org.apache.pig.parser.QueryParserDriver.getMacroFile(QueryParserDriver.java:352) at org.apache.pig.parser.QueryParserDriver.makeMacroDef(QueryParserDriver.java:411) at org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java:270) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:171) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598) ... 29 more Thanks, Chetan Duplicated macro name error when using pigunit -- Key: PIG-3114 URL: https://issues.apache.org/jira/browse/PIG-3114 Project: Pig Issue Type: Bug Components: parser Reporter: Chetan Nadgire I'm using PigUnit to test a pig script within which a macro is defined. Pig runs fine on cluster but getting parsing error with pigunit. So I tried very basic pig script with macro and getting similar error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. line 9 null. Reason: Duplicated macro name 'my_macro_1' at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546) at org.apache.pig.PigServer.registerQuery(PigServer.java:516) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:988) at org.apache.pig.pigunit.pig.GruntParser.processPig(GruntParser.java:61) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194) at org.apache.pig.pigunit.pig.PigServer.registerScript(PigServer.java:56) at org.apache.pig.pigunit.PigTest.registerScript(PigTest.java:160) at
[jira] [Commented] (PIG-3113) Shell command execution hangs job
[ https://issues.apache.org/jira/browse/PIG-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547669#comment-13547669 ] Daniel Dai commented on PIG-3113: - From the post, seems we can process the input and output streams before calling waitFor? Shell command execution hangs job - Key: PIG-3113 URL: https://issues.apache.org/jira/browse/PIG-3113 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.1 Reporter: James Executing a shell command inside a Pig script has the potential to deadlock the job. For example, the following statement will block when somebigfile.txt is sufficiently large: {code} %declare input `cat /path/to/somebigfile.txt` {code} This happens because PreprocessorContext.executeShellCommand(String) incorrectly uses Runtime.exec(). The sub-process's stderr and stdout streams should be read in a separate thread to prevent p.waitFor() from hanging when the sub-process's output is larger than the output buffer. Per the Java Process class javadoc: Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock. See http://www.javaworld.com/jw-12-2000/jw-1229-traps.html for a correct solution. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-842) PigStorage should support multi-byte delimiters
[ https://issues.apache.org/jira/browse/PIG-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-842: --- Assignee: Jeff Markham PigStorage should support multi-byte delimiters --- Key: PIG-842 URL: https://issues.apache.org/jira/browse/PIG-842 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Jeff Markham Attachments: PigMultiByteJsonMetadata.java, PigMultiByteStorage.java, PigMultiByteTextOutputFormat.java Currently, PigStorage supports single byte delimiters. Users have requested mult-byte delimiters. There are performance implications with multi-byte delimiters. i.e., instead of looking for a single byte, PigStorage should look for a pattern ala BinStorage. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-842) PigStorage should support multi-byte delimiters
[ https://issues.apache.org/jira/browse/PIG-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547679#comment-13547679 ] Daniel Dai commented on PIG-842: I prefer a single PigStorage implementation for clarity and code maintenance reason. For performance, we can do some optimization for single character case which could bring the performance of single character delimit near equal. PigStorage should support multi-byte delimiters --- Key: PIG-842 URL: https://issues.apache.org/jira/browse/PIG-842 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Jeff Markham Attachments: PigMultiByteJsonMetadata.java, PigMultiByteStorage.java, PigMultiByteTextOutputFormat.java Currently, PigStorage supports single byte delimiters. Users have requested mult-byte delimiters. There are performance implications with multi-byte delimiters. i.e., instead of looking for a single byte, PigStorage should look for a pattern ala BinStorage. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-842) PigStorage should support multi-byte delimiters
[ https://issues.apache.org/jira/browse/PIG-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547680#comment-13547680 ] Jeff Markham commented on PIG-842: -- Agreed. After doing something separate, there's too much copy/paste code. PigStorage should support multi-byte delimiters --- Key: PIG-842 URL: https://issues.apache.org/jira/browse/PIG-842 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Jeff Markham Attachments: PigMultiByteJsonMetadata.java, PigMultiByteStorage.java, PigMultiByteTextOutputFormat.java Currently, PigStorage supports single byte delimiters. Users have requested mult-byte delimiters. There are performance implications with multi-byte delimiters. i.e., instead of looking for a single byte, PigStorage should look for a pattern ala BinStorage. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3110) pig corrupts chararrays with trailing whitespace when converting them to long
[ https://issues.apache.org/jira/browse/PIG-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547697#comment-13547697 ] Daniel Dai commented on PIG-3110: - Here is the code introduce the issue: {code} try { return Long.parseLong(str); } catch (NumberFormatException e) { try { Double d = Double.valueOf(str); // Need to check for an overflow error if (d.doubleValue() mMaxLong.doubleValue() + 1.0) { LogUtils.warn(CastUtils.class, Value + d + too large for long, PigWarning.TOO_LARGE_FOR_INT, mLog); return null; } return Long.valueOf(d.longValue()); } } {code} Not sure why we still try double, seems adding more confusion. Shall we change? pig corrupts chararrays with trailing whitespace when converting them to long - Key: PIG-3110 URL: https://issues.apache.org/jira/browse/PIG-3110 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.10.0 Reporter: Ido Hadanny when trying to convert the following string into long, pig corrupts it. data: 1703598819951657279 ,44081037 data1 = load 'data' using CSVLoader as (a: chararray ,b: int); data2 = foreach data1 generate (long)a as a; dump data2; (1703598819951657216)--- last 2 digits are corrupted data2 = foreach data1 generate (long)TRIM(a) as a; dump data2; (1703598819951657279)--- correct -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira