[jira] [Commented] (PIG-2828) DataType.compare null
[ https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672880#comment-13672880 ] Aniket Mokashi commented on PIG-2828: - DataType compare api is little broken. public static int compare(Object o1, Object o2) - uses reflection to infer datatypes of o1 and o2. public static int compare(Object o1, Object o2, byte dt1, byte dt2) - doesn't use reflection, however callers of this api use reflection and also deal with NULLs. Currently, callers of second API handle NULLs somewhat similarly but its not consistent. We can refactor the api to avoid reflection and handle NULLs consistently in a separate jira. Right now, TOP that uses second api directly fails with NPE if o1 or o2 has null data. We should fix that with NULL non-NULL semantics. DataType.compare null - Key: PIG-2828 URL: https://issues.apache.org/jira/browse/PIG-2828 Project: Pig Issue Type: Bug Reporter: Haitao Yao Attachments: DataType.patch, test.patch While using TOP, and if the DataBag contains null value to compare, it will generate the following exception: Caused by: java.lang.NullPointerException at org.apache.pig.data.DataType.compare(DataType.java:427) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(PriorityQueue.java:329) at java.util.PriorityQueue.add(PriorityQueue.java:306) at org.apache.pig.builtin.TOP.updateTop(TOP.java:141) at org.apache.pig.builtin.TOP.exec(TOP.java:116) code: (TOP.java, starts with line 91) Object field1 = o1.get(fieldNum); Object field2 = o2.get(fieldNum); if (!typeFound) { datatype = DataType.findType(field1); typeFound = true; } return DataType.compare(field1, field2, datatype, datatype); The reason is that if the typeFound is true , and the dataType is not null, and field1 is null, the script failed. So we need to judge the field1 whether is null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2828) DataType.compare null
[ https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-2828: Status: Patch Available (was: Open) DataType.compare null - Key: PIG-2828 URL: https://issues.apache.org/jira/browse/PIG-2828 Project: Pig Issue Type: Bug Reporter: Haitao Yao Attachments: DataType.patch, PIG-2828.patch, test.patch While using TOP, and if the DataBag contains null value to compare, it will generate the following exception: Caused by: java.lang.NullPointerException at org.apache.pig.data.DataType.compare(DataType.java:427) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(PriorityQueue.java:329) at java.util.PriorityQueue.add(PriorityQueue.java:306) at org.apache.pig.builtin.TOP.updateTop(TOP.java:141) at org.apache.pig.builtin.TOP.exec(TOP.java:116) code: (TOP.java, starts with line 91) Object field1 = o1.get(fieldNum); Object field2 = o2.get(fieldNum); if (!typeFound) { datatype = DataType.findType(field1); typeFound = true; } return DataType.compare(field1, field2, datatype, datatype); The reason is that if the typeFound is true , and the dataType is not null, and field1 is null, the script failed. So we need to judge the field1 whether is null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2828) DataType.compare null
[ https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-2828: Attachment: PIG-2828.patch DataType.compare null - Key: PIG-2828 URL: https://issues.apache.org/jira/browse/PIG-2828 Project: Pig Issue Type: Bug Reporter: Haitao Yao Attachments: DataType.patch, PIG-2828.patch, test.patch While using TOP, and if the DataBag contains null value to compare, it will generate the following exception: Caused by: java.lang.NullPointerException at org.apache.pig.data.DataType.compare(DataType.java:427) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(PriorityQueue.java:329) at java.util.PriorityQueue.add(PriorityQueue.java:306) at org.apache.pig.builtin.TOP.updateTop(TOP.java:141) at org.apache.pig.builtin.TOP.exec(TOP.java:116) code: (TOP.java, starts with line 91) Object field1 = o1.get(fieldNum); Object field2 = o2.get(fieldNum); if (!typeFound) { datatype = DataType.findType(field1); typeFound = true; } return DataType.compare(field1, field2, datatype, datatype); The reason is that if the typeFound is true , and the dataType is not null, and field1 is null, the script failed. So we need to judge the field1 whether is null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2828) DataType.compare null
[ https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672896#comment-13672896 ] Julien Le Dem commented on PIG-2828: Sounds good to me. DataType.compare null - Key: PIG-2828 URL: https://issues.apache.org/jira/browse/PIG-2828 Project: Pig Issue Type: Bug Reporter: Haitao Yao Attachments: DataType.patch, PIG-2828.patch, test.patch While using TOP, and if the DataBag contains null value to compare, it will generate the following exception: Caused by: java.lang.NullPointerException at org.apache.pig.data.DataType.compare(DataType.java:427) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(PriorityQueue.java:329) at java.util.PriorityQueue.add(PriorityQueue.java:306) at org.apache.pig.builtin.TOP.updateTop(TOP.java:141) at org.apache.pig.builtin.TOP.exec(TOP.java:116) code: (TOP.java, starts with line 91) Object field1 = o1.get(fieldNum); Object field2 = o2.get(fieldNum); if (!typeFound) { datatype = DataType.findType(field1); typeFound = true; } return DataType.compare(field1, field2, datatype, datatype); The reason is that if the typeFound is true , and the dataType is not null, and field1 is null, the script failed. So we need to judge the field1 whether is null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/#review21312 --- Just minor comments in the naming of the variable. Java variable names should be camel case. http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java https://reviews.apache.org/r/11333/#comment44210 goldenOutput http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java https://reviews.apache.org/r/11333/#comment44209 output http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java https://reviews.apache.org/r/11333/#comment44211 golden output http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java https://reviews.apache.org/r/11333/#comment44212 fileOutput - Rohini Palaniswamy On May 29, 2013, 11:07 p.m., Viraj Bhat wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/ --- (Updated May 29, 2013, 11:07 p.m.) Review request for pig and Rohini Palaniswamy. Description --- Null pointer exception when loading union with null in it's schema. Test case was also updated with a sample test case. Diffs - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testLoadAvrowithNulls.txt PRE-CREATION Diff: https://reviews.apache.org/r/11333/diff/ Testing --- Yes all tests pass in the piggybank Thanks, Viraj Bhat
Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/#review21316 --- http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java https://reviews.apache.org/r/11333/#comment44214 Isn't a load and store enough to reproduce the test case? Why such a long pig script? - Rohini Palaniswamy On May 29, 2013, 11:07 p.m., Viraj Bhat wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/ --- (Updated May 29, 2013, 11:07 p.m.) Review request for pig and Rohini Palaniswamy. Description --- Null pointer exception when loading union with null in it's schema. Test case was also updated with a sample test case. Diffs - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testLoadAvrowithNulls.txt PRE-CREATION Diff: https://reviews.apache.org/r/11333/diff/ Testing --- Yes all tests pass in the piggybank Thanks, Viraj Bhat
Re: Review Request: PIG-3331 Default values not written to Schema when specified in the output schema
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11355/#review21315 --- http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigSchema2Avro.java https://reviews.apache.org/r/11355/#comment44213 Initialize defaultValue in a variable and pass defaultValue instead of doing a if else condition. http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java https://reviews.apache.org/r/11355/#comment44215 Isn't a load and store enough to reproduce the test case? Why such a long pig script? Please try to keep the unit tests simple. - Rohini Palaniswamy On May 30, 2013, 2:29 a.m., Viraj Bhat wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11355/ --- (Updated May 30, 2013, 2:29 a.m.) Review request for pig and Rohini Palaniswamy. Description --- Patch to write default values to the Schema when the writer schema contains that in the AvroStorage. Diffs - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigSchema2Avro.java 1485826 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java 1485826 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/numbers.txt PRE-CREATION Diff: https://reviews.apache.org/r/11355/diff/ Testing --- Yes against the Piggybank in Pig trunk/Pig 0.12 Thanks, Viraj Bhat
[jira] [Commented] (PIG-3341) Improving performance of loading datetime values
[ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673122#comment-13673122 ] Rohini Palaniswamy commented on PIG-3341: - bq. Before making the fix, I think there needs to be a little more clarity around exactly what formats are supported. For example, pig 0.11.1 currently supports datetime strings with no date - T00:00:00 produces a date in 1970. Is this intentional? I don't think anyone is looking for such a behaviour. Not intuitive. I think we can go with option 1 (more is better) but also state which of those formats supported are not part of w3c profile. We also need to return null if it does not confirm to the format instead of throwing an error. Improving performance of loading datetime values Key: PIG-3341 URL: https://issues.apache.org/jira/browse/PIG-3341 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.11.1 Reporter: pat chan Priority: Minor Fix For: 0.12, 0.11.2 The performance of loading datetime values can be improved by about 25% by moving a single line in ToDate.java: public static DateTimeZone extractDateTimeZone(String dtStr) { Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$);; should become: static Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$); public static DateTimeZone extractDateTimeZone(String dtStr) { There is no need to recompile the regular expression for every value. I'm not sure if this function is ever called concurrently, but Pattern objects are thread-safe anyways. As a test, I created a file of 10M timestamps: for i in 0..1000 puts '2000-01-01T00:00:00+23' end I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is null; dump B; Before the change it took 160s. After the change, the script took 120s. Another performance improvement can be made for invalid datetime values. If a datetime value is invalid, an exception is created and thrown, which is a costly way to fail a validity check. To test the performance impact, I created 10M invalid datetime values: for i in 0..1000 puts '2000-99-01T00:00:00+23' end In this test, the regex pattern was always recompiled. I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is not null; dump B; The script took 190s. I understand this could be considered an edge case and might not be worth changing. However, if there are use cases where invalid dates are part of normal processing, then you might consider fixing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3327) Pig hits OOM when fetching task Reports
[ https://issues.apache.org/jira/browse/PIG-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3327: Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk (0.12). Thanks Cheolsoo Pig hits OOM when fetching task Reports --- Key: PIG-3327 URL: https://issues.apache.org/jira/browse/PIG-3327 Project: Pig Issue Type: Bug Affects Versions: 0.10.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: PIG-3327-1.patch java.lang.OutOfMemoryError: GC overhead limit exceeded is hit with hadoop 23 by the pig script when a launched job has 80K+ maps. The TaskReport[] array is causing OOM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3337) Fix remaining Window e2e tests
[ https://issues.apache.org/jira/browse/PIG-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673234#comment-13673234 ] Hudson commented on PIG-3337: - Integrated in Hive-trunk-h0.21 #2125 (See [https://builds.apache.org/job/Hive-trunk-h0.21/2125/]) PIG-3337: Fix remaining Window e2e tests (Revision 1487967) Result = FAILURE daijy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1487967 Files : * /pig/trunk/CHANGES.txt * /pig/trunk/test/e2e/harness/TestDriver.pm * /pig/trunk/test/e2e/pig/drivers/TestDriverPig.pm Fix remaining Window e2e tests -- Key: PIG-3337 URL: https://issues.apache.org/jira/browse/PIG-3337 Project: Pig Issue Type: Sub-task Components: e2e harness Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.12 Attachments: PIG-3337-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3337) Fix remaining Window e2e tests
[ https://issues.apache.org/jira/browse/PIG-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673254#comment-13673254 ] Rohini Palaniswamy commented on PIG-3337: - [~daijy], Any idea why hive hudson messages are appearing here? Saw this before in PIG-2955 and PIG-3069 also Fix remaining Window e2e tests -- Key: PIG-3337 URL: https://issues.apache.org/jira/browse/PIG-3337 Project: Pig Issue Type: Sub-task Components: e2e harness Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.12 Attachments: PIG-3337-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3343) Refactor DataType.compare api to handle NULLs and reflection
Aniket Mokashi created PIG-3343: --- Summary: Refactor DataType.compare api to handle NULLs and reflection Key: PIG-3343 URL: https://issues.apache.org/jira/browse/PIG-3343 Project: Pig Issue Type: Bug Components: data Reporter: Aniket Mokashi -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2828) DataType.compare null
[ https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673318#comment-13673318 ] Aniket Mokashi commented on PIG-2828: - I have created https://issues.apache.org/jira/browse/PIG-3343 to track api refactor. DataType.compare null - Key: PIG-2828 URL: https://issues.apache.org/jira/browse/PIG-2828 Project: Pig Issue Type: Bug Reporter: Haitao Yao Attachments: DataType.patch, PIG-2828.patch, test.patch While using TOP, and if the DataBag contains null value to compare, it will generate the following exception: Caused by: java.lang.NullPointerException at org.apache.pig.data.DataType.compare(DataType.java:427) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(PriorityQueue.java:329) at java.util.PriorityQueue.add(PriorityQueue.java:306) at org.apache.pig.builtin.TOP.updateTop(TOP.java:141) at org.apache.pig.builtin.TOP.exec(TOP.java:116) code: (TOP.java, starts with line 91) Object field1 = o1.get(fieldNum); Object field2 = o2.get(fieldNum); if (!typeFound) { datatype = DataType.findType(field1); typeFound = true; } return DataType.compare(field1, field2, datatype, datatype); The reason is that if the typeFound is true , and the dataType is not null, and field1 is null, the script failed. So we need to judge the field1 whether is null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3344) Add a spatial datatype to Pig
Ahmed Eldawy created PIG-3344: - Summary: Add a spatial datatype to Pig Key: PIG-3344 URL: https://issues.apache.org/jira/browse/PIG-3344 Project: Pig Issue Type: New Feature Components: parser Reporter: Ahmed Eldawy This issue is about adding a new datatype to Pig that abstracts a spatial attribute. Following OGC [http://www.opengeospatial.org/], we will add a new datatype called 'Geometry' that abstracts all standard shapes (e.g., Point, Polygon and Linestring). This datatype is automatically parsed from either a Well-Known Text (WKT) or Well-Known Binary (WKB) represented as a Hex string. These two types are the standard export formats for OGC shapes and they are supported by many existing tools including PostGIS [http://postgis.net/]. Exporting through PigStorage should default to a WKB represented as Hex string and there will be additional functions to convert to WKT. This new datatype maps internally to the class OGCGeometry [https://github.com/Esri/geometry-api-java/blob/master/src/com/esri/core/geometry/ogc/OGCGeometry.java] licensed under Apache license. This class contains functionality to import/export to the WKT and WKB formats. Data manipulation functions to the new datatype will be all done through UDFs. Currently, there is a spatial extension to Pig (called Pigeon) [https://github.com/aseldawy/pigeon] that provides basic spatial functionality via UDFs powered by the aforementioned library. Currently, it automatically converts WKB and WKT fields to OGCGeometry class, performs the spatial operation, and produces the result back as WKB. Once the Geometry datatype is added, it will natively use it to avoid the conversion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: A major addition to Pig. Working with spatial data
I've just created a new JIRA issue for the spatial functionality. https://issues.apache.org/jira/browse/PIG-3344 This issue is all about the new datatype which is the only thing that needs to be changed internally in Pig in this phase. Pigeon is already working with the ESRI library but it converts between binary representation and Geometry class back and forth. Once the new datatype is added, we can change Pigeon to work with this datatype too. We can still keep the current conversion functionality as it allows the system to automatically perform the conversion from the bytearray datatype as it adds the autodetect functionality when a column is not given a type in the schema. I don't know if I should provide a patch to this issue myself or there is someone else who can work on it. I can of course do it but I think it will take me some time to finish as I'm not yet familiar with the internals of Pig. Someone who is familiar with the parser would definitely make a better job here. I can focus on Pigeon and add more spatial functions there so that we can have a plenty of functions once the new datatype is added. I'm open to both solutions but I'm just checking with you. Thanks Ahmed Best regards, Ahmed Eldawy On Wed, May 29, 2013 at 12:17 PM, Russell Jurney russell.jur...@gmail.comwrote: Awesome. This would be a great addition to Pig. Please create a JIRA. Russell Jurney http://datasyndrome.com On May 29, 2013, at 8:51 AM, Ahmed Eldawy aseld...@gmail.com wrote: Hi all, Nick has pointed out to me an alternative GIS package that can replace JTS. ESRI has recently released a GIS packagehttps://github.com/Esri/geometry-api-javaunder Apache license. I changed Pigeon to work with that new package. I think it could be easier now to integrate this work with main branch of Apache Pig. I will go on with the current project and add more spatial functionality. We can then add a new datatype to Apache and link it to those functions. ESRI package contains a class OGCGeometry http://esri.github.io/geometry-api-java/javadoc/com/esri/core/geometry/ogc/OGCGeometry.html which can be linked to a new datatype 'Geometry'. Do you think we can rely on the new package and integrate the work with Apache Pig? On May 23, 2013 11:40 PM, Ahmed Eldawy aseld...@gmail.com wrote: Hi all, Thanks for your help. I've started the project with a minimal functionality as a start. It's currently hosted in github. It is licensed under the Apache public license to make it easier to merge with Pig. Currently it has only a very few functions. I implemented a function from different types of functions (e.g., Aggregate and create). I'll keep adding functions and any contributions to the project are welcome. As a beginning, I need an ANT build file that runs the tests, compiles and generates a jar file. I'm not familiar with ANT so any help in this is encouraged. Here's the project home page https://github.com/aseldawy/pigeon If you have any comments or suggestion please contact me. Best regards, Ahmed Eldawy On Mon, May 6, 2013 at 3:09 PM, Jonathan Coveney jcove...@gmail.com wrote: Nick: the only issue is that the way types are implemented in Pig don't allow us to easily plug-in types externally. Adding support for that would be cool, but a fair bit of work. 2013/5/6 Nick Dimiduk ndimi...@gmail.com I'm to a lawyer, but I see no reason why this cannot be an external extension to Pig. It would behave the same way PostGIS is an external extension to Postgres. Any Apache issues would be toward general purpose enhancements, not specific to your project. Good on you! -n On Mon, May 6, 2013 at 10:12 AM, Ahmed Eldawy aseld...@gmail.com wrote: I contacted solr developers to see how JTS can be included in an Apache project. See http://mail-archives.apache.org/mod_mbox/lucene-dev/201305.mbox/raw/%3C1367815102914-4060969.post%40n3.nabble.com%3E/ As far as I understand, they did not include it in the main solr project, rather, they created a separate project (spatial 4j) which is still licensed under Apache license and refers to JTS. Users will have to download JTS libraries separately to make it run. That's pretty much the same plan that Jonathan mentioned. We will still have the overhead of serializing/deserializing the shapes each time a function is called. Also, we will have to use the ugly bytearray data type for spatial data instead of creating its own data type (e.g., Geometry). I think using spatial 4j instead of JTS will not be sufficient for our case as we need to provide an access to all spatial functions of JTS such as Union, Intersection, Difference, ... etc. This way we can claim conformity with OGC standards which gives visibility and appreciations of the spatial community. I think also that this means I will not add any issues to JIRA as it is now a
Re: A major addition to Pig. Working with spatial data
Those JIRAs do best that are completed by one person driving them. On Mon, Jun 3, 2013 at 10:26 AM, Ahmed Eldawy aseld...@gmail.com wrote: I've just created a new JIRA issue for the spatial functionality. https://issues.apache.org/jira/browse/PIG-3344 This issue is all about the new datatype which is the only thing that needs to be changed internally in Pig in this phase. Pigeon is already working with the ESRI library but it converts between binary representation and Geometry class back and forth. Once the new datatype is added, we can change Pigeon to work with this datatype too. We can still keep the current conversion functionality as it allows the system to automatically perform the conversion from the bytearray datatype as it adds the autodetect functionality when a column is not given a type in the schema. I don't know if I should provide a patch to this issue myself or there is someone else who can work on it. I can of course do it but I think it will take me some time to finish as I'm not yet familiar with the internals of Pig. Someone who is familiar with the parser would definitely make a better job here. I can focus on Pigeon and add more spatial functions there so that we can have a plenty of functions once the new datatype is added. I'm open to both solutions but I'm just checking with you. Thanks Ahmed Best regards, Ahmed Eldawy On Wed, May 29, 2013 at 12:17 PM, Russell Jurney russell.jur...@gmail.comwrote: Awesome. This would be a great addition to Pig. Please create a JIRA. Russell Jurney http://datasyndrome.com On May 29, 2013, at 8:51 AM, Ahmed Eldawy aseld...@gmail.com wrote: Hi all, Nick has pointed out to me an alternative GIS package that can replace JTS. ESRI has recently released a GIS packagehttps://github.com/Esri/geometry-api-javaunder Apache license. I changed Pigeon to work with that new package. I think it could be easier now to integrate this work with main branch of Apache Pig. I will go on with the current project and add more spatial functionality. We can then add a new datatype to Apache and link it to those functions. ESRI package contains a class OGCGeometry http://esri.github.io/geometry-api-java/javadoc/com/esri/core/geometry/ogc/OGCGeometry.html which can be linked to a new datatype 'Geometry'. Do you think we can rely on the new package and integrate the work with Apache Pig? On May 23, 2013 11:40 PM, Ahmed Eldawy aseld...@gmail.com wrote: Hi all, Thanks for your help. I've started the project with a minimal functionality as a start. It's currently hosted in github. It is licensed under the Apache public license to make it easier to merge with Pig. Currently it has only a very few functions. I implemented a function from different types of functions (e.g., Aggregate and create). I'll keep adding functions and any contributions to the project are welcome. As a beginning, I need an ANT build file that runs the tests, compiles and generates a jar file. I'm not familiar with ANT so any help in this is encouraged. Here's the project home page https://github.com/aseldawy/pigeon If you have any comments or suggestion please contact me. Best regards, Ahmed Eldawy On Mon, May 6, 2013 at 3:09 PM, Jonathan Coveney jcove...@gmail.com wrote: Nick: the only issue is that the way types are implemented in Pig don't allow us to easily plug-in types externally. Adding support for that would be cool, but a fair bit of work. 2013/5/6 Nick Dimiduk ndimi...@gmail.com I'm to a lawyer, but I see no reason why this cannot be an external extension to Pig. It would behave the same way PostGIS is an external extension to Postgres. Any Apache issues would be toward general purpose enhancements, not specific to your project. Good on you! -n On Mon, May 6, 2013 at 10:12 AM, Ahmed Eldawy aseld...@gmail.com wrote: I contacted solr developers to see how JTS can be included in an Apache project. See http://mail-archives.apache.org/mod_mbox/lucene-dev/201305.mbox/raw/%3C1367815102914-4060969.post%40n3.nabble.com%3E/ As far as I understand, they did not include it in the main solr project, rather, they created a separate project (spatial 4j) which is still licensed under Apache license and refers to JTS. Users will have to download JTS libraries separately to make it run. That's pretty much the same plan that Jonathan mentioned. We will still have the overhead of serializing/deserializing the shapes each time a function is called. Also, we will have to use the ugly bytearray data type for spatial data instead of creating its own data type (e.g., Geometry). I think using spatial 4j instead of JTS will not be sufficient for our case as we need to provide an access to all spatial functions of JTS
[jira] [Updated] (PIG-2828) DataType.compare null
[ https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-2828: --- Assignee: Aniket Mokashi +1 to PIG-2828.patch. Looks good to me. [~aniket486], can you please replace all the tabs with 4 spaces when committing your patch? Thanks! DataType.compare null - Key: PIG-2828 URL: https://issues.apache.org/jira/browse/PIG-2828 Project: Pig Issue Type: Bug Reporter: Haitao Yao Assignee: Aniket Mokashi Attachments: DataType.patch, PIG-2828.patch, test.patch While using TOP, and if the DataBag contains null value to compare, it will generate the following exception: Caused by: java.lang.NullPointerException at org.apache.pig.data.DataType.compare(DataType.java:427) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(PriorityQueue.java:329) at java.util.PriorityQueue.add(PriorityQueue.java:306) at org.apache.pig.builtin.TOP.updateTop(TOP.java:141) at org.apache.pig.builtin.TOP.exec(TOP.java:116) code: (TOP.java, starts with line 91) Object field1 = o1.get(fieldNum); Object field2 = o2.get(fieldNum); if (!typeFound) { datatype = DataType.findType(field1); typeFound = true; } return DataType.compare(field1, field2, datatype, datatype); The reason is that if the typeFound is true , and the dataType is not null, and field1 is null, the script failed. So we need to judge the field1 whether is null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3279) Support nested RANK
[ https://issues.apache.org/jira/browse/PIG-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Zhang updated PIG-3279: -- Attachment: PIG-3279-3.patch.txt Thanks a lot for your comments, [~daijy]! Appreciate. I changed LogToPhyTranslationVisitor.java: 1. for RANK BY operation, only include POSort - POCounter - PORank - POForEach. The current physical plan looks like: c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-42 | |---c: New For Each(true)[bag] - scope-41 | | | RelationToExpressionProject[bag][*] - scope-32 | | | |---New For Each(false,true)[tuple] - scope-40 | | | | | Project[long][0] - scope-38 | | | | | Project[bag][2] - scope-39 | | | |---d: PORank[tuple] - scope-37 | | | | | Project[int][0] - scope-34 | | | |---d: POCounter[tuple] - scope-36 | | | | | Project[int][0] - scope-34 | | | |---d: POSort[tuple]() - scope-35 | | | | | Project[int][0] - scope-34 | | | |---Project[bag][1] - scope-33 | |---b: Package[tuple]{chararray} - scope-29 | |---b: Global Rearrange[tuple] - scope-28 | |---b: Local Rearrange[tuple]{chararray}(false) - scope-30 | | | Project[chararray][1] - scope-31 | |---a: New For Each(false,false,false)[bag] - scope-27 | | | Cast[chararray] - scope-19 | | | |---Project[bytearray][0] - scope-18 | | | Cast[chararray] - scope-22 | | | |---Project[bytearray][1] - scope-21 | | | Cast[int] - scope-25 | | | |---Project[bytearray][2] - scope-24 | |---a: Load(file:///home/xiaoyuz/PIG-new/pig/input1:org.apache.pig.builtin.PigStorage) - scope-17 2. for RANK operation, there is no difference between nested and non-nested RANK. Since there is no POPackage, global rearrange for non-nested RANK anyway However, I still got exception for RANK BY and RANK operations {noformat} Caused by: java.lang.RuntimeException: Unable to read counter pig.counters.counter_2415405541993583480_-1 at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORank.addRank(PORank.java:165) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORank.getNextTuple(PORank.java:134) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:281) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242) ... 13 more {noformat} thing get closer, but still not complete. Thanks. Support nested RANK --- Key: PIG-3279 URL: https://issues.apache.org/jira/browse/PIG-3279 Project: Pig Issue Type: Improvement Reporter: Gianmarco De Francisci Morales Assignee: Johnny Zhang Attachments: PIG-3279-1.patch.txt, PIG-3279-2.patch.txt, PIG-3279-3.patch.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2828) DataType.compare null
[ https://issues.apache.org/jira/browse/PIG-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-2828: Attachment: PIG-2828-format.patch DataType.compare null - Key: PIG-2828 URL: https://issues.apache.org/jira/browse/PIG-2828 Project: Pig Issue Type: Bug Reporter: Haitao Yao Assignee: Aniket Mokashi Attachments: DataType.patch, PIG-2828-format.patch, PIG-2828.patch, test.patch While using TOP, and if the DataBag contains null value to compare, it will generate the following exception: Caused by: java.lang.NullPointerException at org.apache.pig.data.DataType.compare(DataType.java:427) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:97) at org.apache.pig.builtin.TOP$TupleComparator.compare(TOP.java:1) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(PriorityQueue.java:329) at java.util.PriorityQueue.add(PriorityQueue.java:306) at org.apache.pig.builtin.TOP.updateTop(TOP.java:141) at org.apache.pig.builtin.TOP.exec(TOP.java:116) code: (TOP.java, starts with line 91) Object field1 = o1.get(fieldNum); Object field2 = o2.get(fieldNum); if (!typeFound) { datatype = DataType.findType(field1); typeFound = true; } return DataType.compare(field1, field2, datatype, datatype); The reason is that if the typeFound is true , and the dataType is not null, and field1 is null, the script failed. So we need to judge the field1 whether is null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3345) Handle null in DateTime functions
[ https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3345: Attachment: PIG-3345-1.patch Handle null in DateTime functions - Key: PIG-3345 URL: https://issues.apache.org/jira/browse/PIG-3345 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: PIG-3345-1.patch NPE is thrown in date time functions when a null value is passed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3345) Handle null in DateTime functions
[ https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3345: Status: Patch Available (was: Open) Handle null in DateTime functions - Key: PIG-3345 URL: https://issues.apache.org/jira/browse/PIG-3345 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: PIG-3345-1.patch NPE is thrown in date time functions when a null value is passed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3285) Jobs using HBaseStorage fail to ship dependency jars
[ https://issues.apache.org/jira/browse/PIG-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3285: Status: Open (was: Patch Available) Canceling patch for now so that it does not show in Patch Available list. Jobs using HBaseStorage fail to ship dependency jars Key: PIG-3285 URL: https://issues.apache.org/jira/browse/PIG-3285 Project: Pig Issue Type: Bug Reporter: Nick Dimiduk Assignee: Nick Dimiduk Fix For: 0.11.1 Attachments: 0001-PIG-3285-Add-HBase-dependency-jars.patch, 0001-PIG-3285-Add-HBase-dependency-jars.patch, 1.pig, 1.txt, 2.pig Launching a job consuming {{HBaseStorage}} fails out of the box. The user must specify {{-Dpig.additional.jars}} for HBase and all of its dependencies. Exceptions look something like this: {noformat} 2013-04-19 18:58:39,360 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoClassDefFoundError: com/google/protobuf/Message at org.apache.hadoop.hbase.io.HbaseObjectWritable.clinit(HbaseObjectWritable.java:266) at org.apache.hadoop.hbase.ipc.Invocation.write(Invocation.java:139) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:612) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:975) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:84) at $Proxy7.getProtocolVersion(Unknown Source) at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:136) at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3342) Allow conditions in case statement
[ https://issues.apache.org/jira/browse/PIG-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673690#comment-13673690 ] Rohini Palaniswamy commented on PIG-3342: - Since it is slightly big, can you upload it in review board? Allow conditions in case statement -- Key: PIG-3342 URL: https://issues.apache.org/jira/browse/PIG-3342 Project: Pig Issue Type: Improvement Components: parser Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.12 Attachments: PIG-3342.patch PIG-3268 added case statement support. But conditions are currently not allowed in when branches. For example, {code} CASE WHEN i % 5 == 0 THEN '5n' WHEN i % 5 == 1 THEN '5n+1' WHEN i % 5 == 2 THEN '5n+2' WHEN i % 5 == 3 THEN '5n+3' ELSE '5n+4' END {code} This is invalid now. However, it will be useful if it's allowed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3341) Improving performance of loading datetime values
[ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673742#comment-13673742 ] pat chan commented on PIG-3341: --- Hi, you bring up two good design points. 1. are more formats the better for this use case? Some possible cons: a) the spec becomes more complicated for probably unused formats. The simplest spec would be to conform to the w3c profile. b) you will have to support all these formats forever c) there could be a performance overhead to support the possibly unused formats d) ToDate(s,f) and UDFs already give users the ability to handle any format that's needed. e) asymmetry: seems cleaner if the default parseable format is exactly the default printed format 2. What is the design philosophy for invalid conversions? Quietly turning invalid values into null seems like it could be a possibly dangerous default since it would be really hard to know if your query on terabytes of data is encountering problems which are quietly being ignored. A safer philosophy would have the default be as strict with the data as possible and then if the user finds a legitimate case for null-conversions, provide a way for the user to enable it explicitly in the script. cheers Improving performance of loading datetime values Key: PIG-3341 URL: https://issues.apache.org/jira/browse/PIG-3341 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.11.1 Reporter: pat chan Priority: Minor Fix For: 0.12, 0.11.2 The performance of loading datetime values can be improved by about 25% by moving a single line in ToDate.java: public static DateTimeZone extractDateTimeZone(String dtStr) { Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$);; should become: static Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$); public static DateTimeZone extractDateTimeZone(String dtStr) { There is no need to recompile the regular expression for every value. I'm not sure if this function is ever called concurrently, but Pattern objects are thread-safe anyways. As a test, I created a file of 10M timestamps: for i in 0..1000 puts '2000-01-01T00:00:00+23' end I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is null; dump B; Before the change it took 160s. After the change, the script took 120s. Another performance improvement can be made for invalid datetime values. If a datetime value is invalid, an exception is created and thrown, which is a costly way to fail a validity check. To test the performance impact, I created 10M invalid datetime values: for i in 0..1000 puts '2000-99-01T00:00:00+23' end In this test, the regex pattern was always recompiled. I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is not null; dump B; The script took 190s. I understand this could be considered an edge case and might not be worth changing. However, if there are use cases where invalid dates are part of normal processing, then you might consider fixing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3341) Improving performance of loading datetime values
[ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673760#comment-13673760 ] Rohini Palaniswamy commented on PIG-3341: - The current behavior returns null if there is a invalid value while loading as datetime. Pig as far as I have seen does not fail loading when there is invalid values. But UDFs do fail. Asking the old timers.. [~alangates]/[~daijy]/[~dvryaboy]/[~julienledem]/[~thejas], How should we handle the invalid dates? Improving performance of loading datetime values Key: PIG-3341 URL: https://issues.apache.org/jira/browse/PIG-3341 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.11.1 Reporter: pat chan Priority: Minor Fix For: 0.12, 0.11.2 The performance of loading datetime values can be improved by about 25% by moving a single line in ToDate.java: public static DateTimeZone extractDateTimeZone(String dtStr) { Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$);; should become: static Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$); public static DateTimeZone extractDateTimeZone(String dtStr) { There is no need to recompile the regular expression for every value. I'm not sure if this function is ever called concurrently, but Pattern objects are thread-safe anyways. As a test, I created a file of 10M timestamps: for i in 0..1000 puts '2000-01-01T00:00:00+23' end I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is null; dump B; Before the change it took 160s. After the change, the script took 120s. Another performance improvement can be made for invalid datetime values. If a datetime value is invalid, an exception is created and thrown, which is a costly way to fail a validity check. To test the performance impact, I created 10M invalid datetime values: for i in 0..1000 puts '2000-99-01T00:00:00+23' end In this test, the regex pattern was always recompiled. I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is not null; dump B; The script took 190s. I understand this could be considered an edge case and might not be worth changing. However, if there are use cases where invalid dates are part of normal processing, then you might consider fixing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3345) Handle null in DateTime functions
[ https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673761#comment-13673761 ] Prashant Kommireddi commented on PIG-3345: -- Hi [~rohini], patch looks good. Would you like to add tests for ToDate* functions too (under testConversionBetweenDateTimeAndString())? Handle null in DateTime functions - Key: PIG-3345 URL: https://issues.apache.org/jira/browse/PIG-3345 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: PIG-3345-1.patch NPE is thrown in date time functions when a null value is passed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3341) Improving performance of loading datetime values
[ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673780#comment-13673780 ] Dmitriy V. Ryaboy commented on PIG-3341: I don't think we are completely consistent, but turning invalid into null has been pretty standard. My personal preference is also to increment a counter for # of such conversions, and to log the first N occurrences (when N errors are encountered, log something to the effect of not logging this error any more because there's so much of it.) Improving performance of loading datetime values Key: PIG-3341 URL: https://issues.apache.org/jira/browse/PIG-3341 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.11.1 Reporter: pat chan Priority: Minor Fix For: 0.12, 0.11.2 The performance of loading datetime values can be improved by about 25% by moving a single line in ToDate.java: public static DateTimeZone extractDateTimeZone(String dtStr) { Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$);; should become: static Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$); public static DateTimeZone extractDateTimeZone(String dtStr) { There is no need to recompile the regular expression for every value. I'm not sure if this function is ever called concurrently, but Pattern objects are thread-safe anyways. As a test, I created a file of 10M timestamps: for i in 0..1000 puts '2000-01-01T00:00:00+23' end I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is null; dump B; Before the change it took 160s. After the change, the script took 120s. Another performance improvement can be made for invalid datetime values. If a datetime value is invalid, an exception is created and thrown, which is a costly way to fail a validity check. To test the performance impact, I created 10M invalid datetime values: for i in 0..1000 puts '2000-99-01T00:00:00+23' end In this test, the regex pattern was always recompiled. I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is not null; dump B; The script took 190s. I understand this could be considered an edge case and might not be worth changing. However, if there are use cases where invalid dates are part of normal processing, then you might consider fixing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/ --- (Updated June 4, 2013, 12:15 a.m.) Review request for pig and Rohini Palaniswamy. Changes --- Using MockStorage instead of the PigStorage and comparing results inline for 4 records. Description --- Null pointer exception when loading union with null in it's schema. Test case was also updated with a sample test case. Diffs (updated) - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java 1485358 Diff: https://reviews.apache.org/r/11333/diff/ Testing --- Yes all tests pass in the piggybank Thanks, Viraj Bhat
Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase
On June 3, 2013, 1:03 p.m., Rohini Palaniswamy wrote: Just minor comments in the naming of the variable. Java variable names should be camel case. Thanks but now the verifyTxtResults method is not used any more - Viraj --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/#review21312 --- On June 4, 2013, 12:15 a.m., Viraj Bhat wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/ --- (Updated June 4, 2013, 12:15 a.m.) Review request for pig and Rohini Palaniswamy. Description --- Null pointer exception when loading union with null in it's schema. Test case was also updated with a sample test case. Diffs - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java 1485358 Diff: https://reviews.apache.org/r/11333/diff/ Testing --- Yes all tests pass in the piggybank Thanks, Viraj Bhat
Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase
On June 2, 2013, 9:27 p.m., Cheolsoo Park wrote: http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java, line 1104 https://reviews.apache.org/r/11333/diff/5/?file=298357#file298357line1104 If you use mock.Storage here instead of PigStoage, you won't need the verifyTextResults method and extra output file. Can you please update your test? Please see org.apache.pig.builtin.mock.Storage.java. Added Mock Storage - Viraj --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/#review21305 --- On June 4, 2013, 12:15 a.m., Viraj Bhat wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/ --- (Updated June 4, 2013, 12:15 a.m.) Review request for pig and Rohini Palaniswamy. Description --- Null pointer exception when loading union with null in it's schema. Test case was also updated with a sample test case. Diffs - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java 1485358 Diff: https://reviews.apache.org/r/11333/diff/ Testing --- Yes all tests pass in the piggybank Thanks, Viraj Bhat
[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema
[ https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-3322: Attachment: test_loadavrowithnulls.avro AVRO: AvroStorage give NPE on reading file with union as top level schema - Key: PIG-3322 URL: https://issues.apache.org/jira/browse/PIG-3322 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11.2 Reporter: Egil Sorensen Assignee: Viraj Bhat Labels: patch Fix For: 0.12 Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro I am getting NPE when loading a file with AvroStorage a file that has schema like: {code} [null,{type:record,name:TUPLE_0,fields:[{name:name,type:[null,string],doc:autogenerated from Pig Field Schema},{name:age,type:[null,int],doc:autogenerated from Pig Field Schema},{name:gpa,type:[null,double],doc:autogenerated from Pig Field Schema}]}] {code} E.g. see the e2e style test, which fails on this: {code} { 'num' = 4, # storing file with Pig type tuple relying on conversion to record # loading using stored schemas 'notmq' = 1, 'pig' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); exec; -- Read back what was stored with Avro u = load ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); describe u; store u into ':OUTPATH:'; \, 'verify_pig_script' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:'; \, }, {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema
[ https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-3322: Attachment: (was: expected_testLoadAvrowithNulls.txt) AVRO: AvroStorage give NPE on reading file with union as top level schema - Key: PIG-3322 URL: https://issues.apache.org/jira/browse/PIG-3322 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11.2 Reporter: Egil Sorensen Assignee: Viraj Bhat Labels: patch Fix For: 0.12 Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro I am getting NPE when loading a file with AvroStorage a file that has schema like: {code} [null,{type:record,name:TUPLE_0,fields:[{name:name,type:[null,string],doc:autogenerated from Pig Field Schema},{name:age,type:[null,int],doc:autogenerated from Pig Field Schema},{name:gpa,type:[null,double],doc:autogenerated from Pig Field Schema}]}] {code} E.g. see the e2e style test, which fails on this: {code} { 'num' = 4, # storing file with Pig type tuple relying on conversion to record # loading using stored schemas 'notmq' = 1, 'pig' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); exec; -- Read back what was stored with Avro u = load ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); describe u; store u into ':OUTPATH:'; \, 'verify_pig_script' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:'; \, }, {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema
[ https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-3322: Attachment: (was: PIG-3322_2.patch) AVRO: AvroStorage give NPE on reading file with union as top level schema - Key: PIG-3322 URL: https://issues.apache.org/jira/browse/PIG-3322 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11.2 Reporter: Egil Sorensen Assignee: Viraj Bhat Labels: patch Fix For: 0.12 Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro I am getting NPE when loading a file with AvroStorage a file that has schema like: {code} [null,{type:record,name:TUPLE_0,fields:[{name:name,type:[null,string],doc:autogenerated from Pig Field Schema},{name:age,type:[null,int],doc:autogenerated from Pig Field Schema},{name:gpa,type:[null,double],doc:autogenerated from Pig Field Schema}]}] {code} E.g. see the e2e style test, which fails on this: {code} { 'num' = 4, # storing file with Pig type tuple relying on conversion to record # loading using stored schemas 'notmq' = 1, 'pig' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); exec; -- Read back what was stored with Avro u = load ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); describe u; store u into ':OUTPATH:'; \, 'verify_pig_script' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:'; \, }, {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema
[ https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-3322: Attachment: (was: test_loadavrowithnulls.avro) AVRO: AvroStorage give NPE on reading file with union as top level schema - Key: PIG-3322 URL: https://issues.apache.org/jira/browse/PIG-3322 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11.2 Reporter: Egil Sorensen Assignee: Viraj Bhat Labels: patch Fix For: 0.12 Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro I am getting NPE when loading a file with AvroStorage a file that has schema like: {code} [null,{type:record,name:TUPLE_0,fields:[{name:name,type:[null,string],doc:autogenerated from Pig Field Schema},{name:age,type:[null,int],doc:autogenerated from Pig Field Schema},{name:gpa,type:[null,double],doc:autogenerated from Pig Field Schema}]}] {code} E.g. see the e2e style test, which fails on this: {code} { 'num' = 4, # storing file with Pig type tuple relying on conversion to record # loading using stored schemas 'notmq' = 1, 'pig' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); exec; -- Read back what was stored with Avro u = load ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); describe u; store u into ':OUTPATH:'; \, 'verify_pig_script' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:'; \, }, {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema
[ https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-3322: Attachment: PIG-3322_3.patch AVRO: AvroStorage give NPE on reading file with union as top level schema - Key: PIG-3322 URL: https://issues.apache.org/jira/browse/PIG-3322 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11.2 Reporter: Egil Sorensen Assignee: Viraj Bhat Labels: patch Fix For: 0.12 Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro I am getting NPE when loading a file with AvroStorage a file that has schema like: {code} [null,{type:record,name:TUPLE_0,fields:[{name:name,type:[null,string],doc:autogenerated from Pig Field Schema},{name:age,type:[null,int],doc:autogenerated from Pig Field Schema},{name:gpa,type:[null,double],doc:autogenerated from Pig Field Schema}]}] {code} E.g. see the e2e style test, which fails on this: {code} { 'num' = 4, # storing file with Pig type tuple relying on conversion to record # loading using stored schemas 'notmq' = 1, 'pig' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); exec; -- Read back what was stored with Avro u = load ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); describe u; store u into ':OUTPATH:'; \, 'verify_pig_script' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:'; \, }, {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-3322 Fix the issue where NPE is thrown when reading a union which has nulls and add a testcase
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/#review21383 --- Ship it! Ship It! - Rohini Palaniswamy On June 4, 2013, 12:15 a.m., Viraj Bhat wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11333/ --- (Updated June 4, 2013, 12:15 a.m.) Review request for pig and Rohini Palaniswamy. Description --- Null pointer exception when loading union with null in it's schema. Test case was also updated with a sample test case. Diffs - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java 1485358 http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java 1485358 Diff: https://reviews.apache.org/r/11333/diff/ Testing --- Yes all tests pass in the piggybank Thanks, Viraj Bhat
[jira] [Commented] (PIG-3341) Improving performance of loading datetime values
[ https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673876#comment-13673876 ] pat chan commented on PIG-3341: --- I was looking in the docs for any documentation on this topic. I found the following in http://wiki.apache.org/pig/UDFManual quote The first thing to decide is what to do with invalid data. This depends on the format of the data. If the data is of type bytearray it means that it has not yet been converted to its proper type. In this case, if the format of the data does not match the expected type, a null value should be returned. If, on the other hand, the input data is of another type, this means that the conversion has already happened and the data should be in the correct format. This is the case with our example and that's why it throws an error (line 16.) Note that WrappedIOException is a helper class to convert the actual exception to an IOException. Also, note that lines 10-11 check if the input data is null or empty and if so returns null. /quote If I'm reading this correctly, it says that if the type of the input doesn't match the signature of the UDF, a null should be returned. However, I get this: grunt A = load 'o' as (a:bytearray); grunt B = foreach A generate ToDate(a); dump B; 2013-06-03 17:15:09,253 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1046: line 2, column 23 Multiple matching functions for org.apache.pig.builtin.ToDate with input schema: ({long}, {chararray}). Please use an explicit cast. It also seems to be saying that if the types are right and the format is invalid, an error should be thrown. I just checked and yes, I get an error. However, this doesn't match Rohini's proposal to return a null instead. Also, as Dmitriy hinted, it's not philosophically consistent with loading behavior where invalid things turn into nulls. 2013-06-03 17:25:12,977 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2013-06-03 17:25:12,981 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B BTW, the note about lines 10-11 isn't quite right. The code in the example doesn't have a check for null and so a null would cause an exception. Improving performance of loading datetime values Key: PIG-3341 URL: https://issues.apache.org/jira/browse/PIG-3341 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.11.1 Reporter: pat chan Priority: Minor Fix For: 0.12, 0.11.2 The performance of loading datetime values can be improved by about 25% by moving a single line in ToDate.java: public static DateTimeZone extractDateTimeZone(String dtStr) { Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$);; should become: static Pattern pattern = Pattern.compile((Z|(?=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$); public static DateTimeZone extractDateTimeZone(String dtStr) { There is no need to recompile the regular expression for every value. I'm not sure if this function is ever called concurrently, but Pattern objects are thread-safe anyways. As a test, I created a file of 10M timestamps: for i in 0..1000 puts '2000-01-01T00:00:00+23' end I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is null; dump B; Before the change it took 160s. After the change, the script took 120s. Another performance improvement can be made for invalid datetime values. If a datetime value is invalid, an exception is created and thrown, which is a costly way to fail a validity check. To test the performance impact, I created 10M invalid datetime values: for i in 0..1000 puts '2000-99-01T00:00:00+23' end In this test, the regex pattern was always recompiled. I then ran this script: grunt A = load 'data' as (a:datetime); B = filter A by a is not null; dump B; The script took 190s. I understand this could be considered an edge case and might not be worth changing. However, if there are use cases where invalid dates are part of normal processing, then you might consider fixing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3346) New property that controls the number of combined splits
Cheolsoo Park created PIG-3346: -- Summary: New property that controls the number of combined splits Key: PIG-3346 URL: https://issues.apache.org/jira/browse/PIG-3346 Project: Pig Issue Type: Improvement Components: impl Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.12 Currently, the size of combined splits can be configured by the {{pig.maxCombinedSplitSize}} property. Although this works fine most of time, it can lead to a undesired situation where a single mapper ends up loading a lot of combined splits. Particularly, this is bad if Pig uploads them from S3. So it will be useful if the max number of combined splits can be configured via a property something like {{pig.maxCombinedSplitNum}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (19 issues) Subscriber: pigdaily Key Summary PIG-3345Handle null in DateTime functions https://issues.apache.org/jira/browse/PIG-3345 PIG-3342Allow conditions in case statement https://issues.apache.org/jira/browse/PIG-3342 PIG-Fix remaining Windows core unit test failures https://issues.apache.org/jira/browse/PIG- PIG-3318AVRO: 'default value' not honored when merging schemas on load with AvroStorage https://issues.apache.org/jira/browse/PIG-3318 PIG-3295Casting from bytearray failing after Union (even when each field is from a single Loader) https://issues.apache.org/jira/browse/PIG-3295 PIG-3288Kill jobs if the number of output files is over a configurable limit https://issues.apache.org/jira/browse/PIG-3288 PIG-3280Document IN operator and CASE expression https://issues.apache.org/jira/browse/PIG-3280 PIG-3257Add unique identifier UDF https://issues.apache.org/jira/browse/PIG-3257 PIG-3247Piggybank functions to mimic OVER clause in SQL https://issues.apache.org/jira/browse/PIG-3247 PIG-3210Pig fails to start when it cannot write log to log files https://issues.apache.org/jira/browse/PIG-3210 PIG-3199Expose LogicalPlan via PigServer API https://issues.apache.org/jira/browse/PIG-3199 PIG-3166Update eclipse .classpath according to ivy library.properties https://issues.apache.org/jira/browse/PIG-3166 PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections https://issues.apache.org/jira/browse/PIG-3123 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3015Rewrite of AvroStorage https://issues.apache.org/jira/browse/PIG-3015 PIG-2828Handle nulls in DataType.compare https://issues.apache.org/jira/browse/PIG-2828 PIG-2248Pig parser does not detect when a macro name masks a UDF name https://issues.apache.org/jira/browse/PIG-2248 PIG-2244Macros cannot be passed relation names https://issues.apache.org/jira/browse/PIG-2244 PIG-1914Support load/store JSON data in Pig https://issues.apache.org/jira/browse/PIG-1914 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384
[jira] [Resolved] (PIG-3322) AVRO: AvroStorage give NPE on reading file with union as top level schema
[ https://issues.apache.org/jira/browse/PIG-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy resolved PIG-3322. - Resolution: Fixed Committed to trunk (0.12). Thanks Viraj and Cheolsoo. AVRO: AvroStorage give NPE on reading file with union as top level schema - Key: PIG-3322 URL: https://issues.apache.org/jira/browse/PIG-3322 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11.2 Reporter: Egil Sorensen Assignee: Viraj Bhat Labels: patch Fix For: 0.12 Attachments: PIG-3322_3.patch, test_loadavrowithnulls.avro I am getting NPE when loading a file with AvroStorage a file that has schema like: {code} [null,{type:record,name:TUPLE_0,fields:[{name:name,type:[null,string],doc:autogenerated from Pig Field Schema},{name:age,type:[null,int],doc:autogenerated from Pig Field Schema},{name:gpa,type:[null,double],doc:autogenerated from Pig Field Schema}]}] {code} E.g. see the e2e style test, which fails on this: {code} { 'num' = 4, # storing file with Pig type tuple relying on conversion to record # loading using stored schemas 'notmq' = 1, 'pig' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); exec; -- Read back what was stored with Avro u = load ':OUTPATH:.intermediate' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); describe u; store u into ':OUTPATH:'; \, 'verify_pig_script' = q\ a = load ':INPATH:/singlefile/studentcomplextab10k' using PigStorage() as (m:[], t:(name:chararray, age:int, gpa:double), b:{t:(name:chararray, age:int, gpa:double)}); b = foreach a generate t; describe b; store b into ':OUTPATH:'; \, }, {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: PIG-3342 Allow conditions in case statement
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/11613/ --- Review request for pig. Description --- Allows condition expression in case statement. This addresses bug PIG-3342. https://issues.apache.org/jira/browse/PIG-3342 Diffs - src/org/apache/pig/parser/AstPrinter.g c2abede src/org/apache/pig/parser/AstValidator.g 2c6d4dc src/org/apache/pig/parser/LogicalPlanGenerator.g 9375d60 src/org/apache/pig/parser/QueryParser.g 2b84c86 test/org/apache/pig/test/TestCase.java dbee495 Diff: https://reviews.apache.org/r/11613/diff/ Testing --- All unit tests pass. Thanks, Cheolsoo Park
[jira] [Commented] (PIG-3342) Allow conditions in case statement
[ https://issues.apache.org/jira/browse/PIG-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673933#comment-13673933 ] Cheolsoo Park commented on PIG-3342: Thanks Rohini for taking a look. Here is the RB request: https://reviews.apache.org/r/11613/ Allow conditions in case statement -- Key: PIG-3342 URL: https://issues.apache.org/jira/browse/PIG-3342 Project: Pig Issue Type: Improvement Components: parser Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.12 Attachments: PIG-3342.patch PIG-3268 added case statement support. But conditions are currently not allowed in when branches. For example, {code} CASE WHEN i % 5 == 0 THEN '5n' WHEN i % 5 == 1 THEN '5n+1' WHEN i % 5 == 2 THEN '5n+2' WHEN i % 5 == 3 THEN '5n+3' ELSE '5n+4' END {code} This is invalid now. However, it will be useful if it's allowed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3346) New property that controls the number of combined splits
[ https://issues.apache.org/jira/browse/PIG-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3346: --- Attachment: PIG-3346.patch The attached patch includes the following changes: * Adds a new property {{pig.maxCombinedSplitNum}}. By default, it is set to Long.MAX_VALUE. * Updates the logic of {{MapRedUtil.getCombinePigSplits()}} to take the number of combined splits into account. * Adds a new test case to {{TestSplitCombine}}. * Updates the document regarding the new property. Test done: * ant test-commit * ant test -Dtestcase=TestSplitCombine Thanks! New property that controls the number of combined splits Key: PIG-3346 URL: https://issues.apache.org/jira/browse/PIG-3346 Project: Pig Issue Type: Improvement Components: impl Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.12 Attachments: PIG-3346.patch Currently, the size of combined splits can be configured by the {{pig.maxCombinedSplitSize}} property. Although this works fine most of time, it can lead to a undesired situation where a single mapper ends up loading a lot of combined splits. Particularly, this is bad if Pig uploads them from S3. So it will be useful if the max number of combined splits can be configured via a property something like {{pig.maxCombinedSplitNum}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3329) RANK operator failed when working with SPLIT
[ https://issues.apache.org/jira/browse/PIG-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673936#comment-13673936 ] Johnny Zhang commented on PIG-3329: --- [~xalan], are you working this right now? I got the similar exception when I was working on another patch, so it will be very nice if I can understand how will you resolve this issue. Thanks a lot! RANK operator failed when working with SPLIT - Key: PIG-3329 URL: https://issues.apache.org/jira/browse/PIG-3329 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Redis Liu Assignee: Allan AvendaƱo Priority: Critical input.txt: 1 2 3 4 5 6 7 8 9 script: a = load 'input.txt' using PigStorage(' ') as (a:int, b:int, c:int); SPLIT a into b if a 0, c if a 5; d = RANK b; dump d; job will fail with error message: java.lang.RuntimeException: Unable to read counter pig.counters.counter_4929375455335572575_-1 at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORank.addRank(PORank.java:161) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PORank.getNext(PORank.java:134) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:214) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:157) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at org.apache.hadoop.mapred.Child$4.run(Child.java:275) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1340) at org.apache.hadoop.mapred.Child.main(Child.java:269) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3346) New property that controls the number of combined splits
[ https://issues.apache.org/jira/browse/PIG-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3346: --- Status: Patch Available (was: Open) New property that controls the number of combined splits Key: PIG-3346 URL: https://issues.apache.org/jira/browse/PIG-3346 Project: Pig Issue Type: Improvement Components: impl Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.12 Attachments: PIG-3346.patch Currently, the size of combined splits can be configured by the {{pig.maxCombinedSplitSize}} property. Although this works fine most of time, it can lead to a undesired situation where a single mapper ends up loading a lot of combined splits. Particularly, this is bad if Pig uploads them from S3. So it will be useful if the max number of combined splits can be configured via a property something like {{pig.maxCombinedSplitNum}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (PIG-3345) Handle null in DateTime functions
[ https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13674025#comment-13674025 ] Rohini Palaniswamy edited comment on PIG-3345 at 6/4/13 4:25 AM: - Thanks Prashant. Added for all the udfs in testConversionBetweenDTAndString was (Author: rohini): Thanks Prashant. Added for all the methods in testConversionBetweenDTAndString Handle null in DateTime functions - Key: PIG-3345 URL: https://issues.apache.org/jira/browse/PIG-3345 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: PIG-3345-1.patch, PIG-3345-2.patch NPE is thrown in date time functions when a null value is passed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3345) Handle null in DateTime functions
[ https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3345: Attachment: PIG-3345-2.patch Thanks Prashant. Added for all the methods in testConversionBetweenDTAndString Handle null in DateTime functions - Key: PIG-3345 URL: https://issues.apache.org/jira/browse/PIG-3345 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: PIG-3345-1.patch, PIG-3345-2.patch NPE is thrown in date time functions when a null value is passed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3345) Handle null in DateTime functions
[ https://issues.apache.org/jira/browse/PIG-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13674070#comment-13674070 ] Prashant Kommireddi commented on PIG-3345: -- LGTM +1 Thanks Rohini! Handle null in DateTime functions - Key: PIG-3345 URL: https://issues.apache.org/jira/browse/PIG-3345 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: PIG-3345-1.patch, PIG-3345-2.patch NPE is thrown in date time functions when a null value is passed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira