[jira] [Updated] (PIG-3526) Unions with Enums do not work with AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3526: -- Attachment: PIG-3526.patch Patch for this issue. Unions with Enums do not work with AvroStorage -- Key: PIG-3526 URL: https://issues.apache.org/jira/browse/PIG-3526 Project: Pig Issue Type: Bug Components: internal-udfs Affects Versions: 0.12.0 Reporter: Joseph Adler Fix For: 0.12.1 Attachments: PIG-3526.patch If you have an input schema with unions of enum types and nulls, AvroStorage can't read the data correctly. This patch will translate the enums to strings so that Pig can process them. (Sorry for the short description and lack of a unit test; ran into this issue while working on a deadline for another project.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (PIG-3526) Unions with Enums do not work with AvroStorage
Joseph Adler created PIG-3526: - Summary: Unions with Enums do not work with AvroStorage Key: PIG-3526 URL: https://issues.apache.org/jira/browse/PIG-3526 Project: Pig Issue Type: Bug Components: internal-udfs Affects Versions: 0.12.0 Reporter: Joseph Adler Fix For: 0.12.1 Attachments: PIG-3526.patch If you have an input schema with unions of enum types and nulls, AvroStorage can't read the data correctly. This patch will translate the enums to strings so that Pig can process them. (Sorry for the short description and lack of a unit test; ran into this issue while working on a deadline for another project.) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag
[ https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788655#comment-13788655 ] Joseph Adler commented on PIG-3377: --- Working on this now... New AvroStorage throws NPE when storing untyped map/array/bag - Key: PIG-3377 URL: https://issues.apache.org/jira/browse/PIG-3377 Project: Pig Issue Type: Bug Components: internal-udfs Reporter: Cheolsoo Park Assignee: Joseph Adler Fix For: 0.12.1 The following example demonstrates the issue: {code} a = LOAD 'foo' AS (m:map[]); STORE a INTO 'bar' USING AvroStorage(); {code} This fails with the following error: {code} java.lang.NullPointerException at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462) at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472) {code} Similarly, untyped bag causes the following error: {code} Caused by: java.lang.NullPointerException at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722) ... at org.apache.avro.Schema.getElementType(Schema.java:256) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491) {code} The problem is that AvroStorage cannot derive the output schema from untyped map/bag/tuple. When type is not defined, it should be assumed as bytearray. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag
[ https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3377: -- Status: Patch Available (was: Open) New AvroStorage throws NPE when storing untyped map/array/bag - Key: PIG-3377 URL: https://issues.apache.org/jira/browse/PIG-3377 Project: Pig Issue Type: Bug Components: internal-udfs Reporter: Cheolsoo Park Assignee: Joseph Adler Fix For: 0.12.1 Attachments: PIG-3377.patch The following example demonstrates the issue: {code} a = LOAD 'foo' AS (m:map[]); STORE a INTO 'bar' USING AvroStorage(); {code} This fails with the following error: {code} java.lang.NullPointerException at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462) at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472) {code} Similarly, untyped bag causes the following error: {code} Caused by: java.lang.NullPointerException at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722) ... at org.apache.avro.Schema.getElementType(Schema.java:256) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491) {code} The problem is that AvroStorage cannot derive the output schema from untyped map/bag/tuple. When type is not defined, it should be assumed as bytearray. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag
[ https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3377: -- Attachment: PIG-3377.patch Patch for this issue (provides a meaningful error message) New AvroStorage throws NPE when storing untyped map/array/bag - Key: PIG-3377 URL: https://issues.apache.org/jira/browse/PIG-3377 Project: Pig Issue Type: Bug Components: internal-udfs Reporter: Cheolsoo Park Assignee: Joseph Adler Fix For: 0.12.1 Attachments: PIG-3377.patch The following example demonstrates the issue: {code} a = LOAD 'foo' AS (m:map[]); STORE a INTO 'bar' USING AvroStorage(); {code} This fails with the following error: {code} java.lang.NullPointerException at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462) at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472) {code} Similarly, untyped bag causes the following error: {code} Caused by: java.lang.NullPointerException at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722) ... at org.apache.avro.Schema.getElementType(Schema.java:256) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491) {code} The problem is that AvroStorage cannot derive the output schema from untyped map/bag/tuple. When type is not defined, it should be assumed as bytearray. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag
[ https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711378#comment-13711378 ] Joseph Adler commented on PIG-3377: --- Want to assign this to me? I can take a look at this and submit a patch. New AvroStorage throws NPE when storing untyped map/array/bag - Key: PIG-3377 URL: https://issues.apache.org/jira/browse/PIG-3377 Project: Pig Issue Type: Bug Components: internal-udfs Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.12 The following example demonstrates the issue: {code} a = LOAD 'foo' AS (m:map[]); STORE a INTO 'bar' USING AvroStorage(); {code} This fails with the following error: {code} java.lang.NullPointerException at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462) at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472) {code} Similarly, untyped bag causes the following error: {code} Caused by: java.lang.NullPointerException at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722) ... at org.apache.avro.Schema.getElementType(Schema.java:256) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491) {code} The problem is that AvroStorage cannot derive the output schema from untyped map/bag/tuple. When type is not defined, it should be assumed as bytearray. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-3015 Rewrite of AvroStorage
On March 19, 2013, 4:40 p.m., Jonathan Coveney wrote: src/org/apache/pig/builtin/AvroStorage.java, line 352 https://reviews.apache.org/r/8104/diff/4/?file=244837#file244837line352 I realize using Long's compareTo is convenient, but this seems like unnecessary boxing. why not just compare them directly? I realize this isn't performance critical cord, it just stuck out to me, since you could just do a instead... For sorting, you need to implement compare (which tests for , ==, and ). I switched to com.google.common.primitives.Longs.compare On March 19, 2013, 4:40 p.m., Jonathan Coveney wrote: src/org/apache/pig/impl/util/avro/AvroTupleWrapper.java, line 66 https://reviews.apache.org/r/8104/diff/4/?file=244846#file244846line66 May want to throw an UnsupportedOperationException instead, as if this is being called, it's a more fundamental issue with Pig, separate from write related issues. Stuck with the exceptions in the existing Tuple interface... but yes, that would be more logical On March 19, 2013, 4:40 p.m., Jonathan Coveney wrote: src/org/apache/pig/impl/util/avro/AvroTupleWrapper.java, line 84 https://reviews.apache.org/r/8104/diff/4/?file=244846#file244846line84 shouldn't this throw an error? Or is avroObject.put() doing something I don't expect, perhaps being 1-indexed instead of 0-indexed? I think that write is never called; in the current version it just throws an error - Joseph --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8104/#review18077 --- On Jan. 4, 2013, 7:22 p.m., Joseph Adler wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8104/ --- (Updated Jan. 4, 2013, 7:22 p.m.) Review request for pig and Cheolsoo Park. Description --- The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. This is the latest version of the patch, complete with test cases and TrevniStorage. (Test cases for TrevniStorage are still missing). This addresses bug PIG-3015. https://issues.apache.org/jira/browse/PIG-3015 Diffs - .eclipse.templates/.classpath c7b83b8 ivy.xml 70e8d50 ivy/libraries.properties 7b07c7e src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION src/org/apache/pig/impl/util/avro/AvroArrayReader.java PRE-CREATION src/org/apache/pig/impl/util/avro/AvroBagWrapper.java PRE-CREATION src/org/apache/pig/impl/util/avro/AvroMapWrapper.java PRE-CREATION src/org/apache/pig/impl/util/avro/AvroRecordReader.java PRE-CREATION src/org/apache/pig/impl/util/avro/AvroRecordWriter.java PRE-CREATION src/org/apache/pig/impl/util/avro/AvroStorageDataConversionUtilities.java PRE-CREATION src/org/apache/pig/impl/util/avro/AvroStorageSchemaConversionUtilities.java PRE-CREATION src/org/apache/pig/impl/util/avro/AvroTupleWrapper.java PRE-CREATION test/commit-tests 5081fbc test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity_blank_first_args.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity_just_ao2.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/namesWithDoubleColons.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/recursive_tests.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/trevni_to_avro.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/trevni_to_trevni.pig PRE-CREATION test/org/apache/pig/builtin/avro/data/json/arrays.json PRE-CREATION test/org/apache/pig/builtin/avro/data/json/arraysAsOutputByPig.json PRE-CREATION test/org/apache/pig/builtin/avro/data/json/recordWithRepeatedSubRecords.json PRE-CREATION test/org/apache/pig/builtin/avro/data/json/records.json PRE-CREATION test/org/apache
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015-20May2013.diff I'm getting confused by the names of the diffs. This one is a diff from trunk, as of now. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-10.patch, PIG-3015-11.patch, PIG-3015-12.patch, PIG-3015-20May2013.diff, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3330) please fix the change that created a dependency on org.apache.pig.impl.PigImplConstants
Joseph Adler created PIG-3330: - Summary: please fix the change that created a dependency on org.apache.pig.impl.PigImplConstants Key: PIG-3330 URL: https://issues.apache.org/jira/browse/PIG-3330 Project: Pig Issue Type: Bug Reporter: Joseph Adler Priority: Blocker I can't build Pig from trunk because several source files (including org.apache.pig.Main.java) require org.apache.pig.impl.PigImplConstants, but that class isn't available. I'm assuming someone left out a file on a recent commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015-12.patch Incremental patch that adds support for push down projections, fixed some bugs with options, gets all the test cases working again Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-10.patch, PIG-3015-11.patch, PIG-3015-12.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645823#comment-13645823 ] Joseph Adler commented on PIG-3015: --- [~rohini]: Great question. I definitely implemented that interface in an earlier version; I'm not sure what happened to the code. Let me go through the patches to figure that one out. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-10.patch, PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645826#comment-13645826 ] Joseph Adler commented on PIG-3015: --- [~rohini] OK, looks like I implemented the helper functions, and implemented the functionality for Trevni, but didn't implement it for AvroStorage. Will follow up with a patch. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-10.patch, PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13632347#comment-13632347 ] Joseph Adler commented on PIG-3015: --- Sorry to have taken so long to reply. I map any Pig type to a union of an Avro Type and Null. Here are the type mappings that I implemented: Bag - Array Big Chararray - String Byte Array - Bytes Chararray - String Datetime - Long Double - Double Float - Float Integer - Int Map - Map Null - Null Tuple - Record Byte, Error, Generic Writable, Internal Map, Unknown aren't mapped to anything yet. Do we need to store these as well? Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-10.patch, PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605579#comment-13605579 ] Joseph Adler commented on PIG-3015: --- I like the -tagsource option idea. Should we allow the user to provide a name for the tag source field? (If we picked a name like tagSource, and there was already a field in the avro Schema called tagSource, I'm concerned that we'd have to deal with that conflict. I think it would be cleaner to let the end user resolve the naming issue.) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-10.patch, PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: with_dates.pig Missing test file (not a patch) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-10.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581460#comment-13581460 ] Joseph Adler commented on PIG-3015: --- [~russell.jurney]: ]Reading through the stack trace that you posted, it does not look like the null pointer exception was occurring in TrevniStorage. (It looks like it was occurring in the Tokenizer). Does your script work correctly if you use it with another format, like PigStorage? Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-doc.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015-9.patch Added support for Pig dates to AvroStorage and TrevniStorage (they're translated to longs when storing values). Also added a new test case. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577185#comment-13577185 ] Joseph Adler commented on PIG-3015: --- I think the method setLocation for AvroStoage is marked as final. Does anyone object to removing the final modifier? Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-8.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015-8.patch Added description of AvroStorage and TrevniStorage to documentation. (Not finished editing yet, but wanted to share what I'd written so far.) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-8.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569190#comment-13569190 ] Joseph Adler commented on PIG-3015: --- Let me know what help you need. I can work on the documentation as well. Is early next week enough time? (Also, check out Avro-1241. I couldn't get adequate performance without it.) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2266) bug with input file joining optimization in Pig
[ https://issues.apache.org/jira/browse/PIG-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564924#comment-13564924 ] Joseph Adler commented on PIG-2266: --- Thanks for adding this fix! bug with input file joining optimization in Pig --- Key: PIG-2266 URL: https://issues.apache.org/jira/browse/PIG-2266 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0, 0.10.0 Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-2266.patch In src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java, the function hasTooManyInputFiles instantiated a LoadFunc instance, then calls setLocation before calling setUDFContextSignature. This is inconsistent with the documentation for the LoadFunc interface (see http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/LoadFunc.html#setUDFContextSignature(java.lang.String)). (We've written UDFs that assume that setUDFContextSignature is called first.) I think you can fix this by adding loader.setUDFContextSignature(ld.getSignature()); Before loader.setLocation(location, job); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564926#comment-13564926 ] Joseph Adler commented on PIG-3015: --- Sorry, didn't mean to submit a patch with Avro 1.7.4-SNAPSHOT. I added a couple optimizations to Trevni so that the performance was comparable with Avro. (I'll submit that patch to Avro.) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015-6.patch Some additional bug fixes: - Now correctly identifies recursive schema definitions - TrevniStorage was not correctly flushing output buffers before closing, causing files to be corrupted Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3120) setStoreFuncUDFContextSignature called with null signature
Joseph Adler created PIG-3120: - Summary: setStoreFuncUDFContextSignature called with null signature Key: PIG-3120 URL: https://issues.apache.org/jira/browse/PIG-3120 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.12 Reporter: Joseph Adler Priority: Critical Fix For: 0.12 I'm currently working on PIG-3015 and am having trouble passing the UDFContextSignature to the store func. It looks like the store func on the head end is being set to a non-null value, but a null value is being passed to setStoreFuncUDFContextSignature on the back end. I'm opening this ticket to track this issue; I'll follow up with a reproducible test case when I have a clean one. I suspect this problem occurs when running on a real cluster, but may not occur in the standard unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3120) setStoreFuncUDFContextSignature called with null signature
[ https://issues.apache.org/jira/browse/PIG-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553352#comment-13553352 ] Joseph Adler commented on PIG-3120: --- OK, tracked down the issue. It looks like the UDFContextSignature is not getting propagated if there is a LIMIT statement in the pig code. Very specifically, in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.LimitAdjuster.adjust, it looks like Pig was creating a new POStore object but not copying the signature. Here is the offending code: {code} // this is line 132... POStore st = new POStore(new OperatorKey(scope,nig.getNextNodeId(scope))); st.setSFile(oldSpec); st.setIsTmpStore(oldIsTmpStore); st.setSchema(((POStore)mpLeaf).getSchema()); limitAdjustMROp.reducePlan.addAsLeaf(st); {code} This is easily fixable by inserting this statement at line 137: {code} st.setSignature(((POStore)mpLeaf).getSignature()); {code} I'll follow up with a path for this issue. setStoreFuncUDFContextSignature called with null signature -- Key: PIG-3120 URL: https://issues.apache.org/jira/browse/PIG-3120 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.12 Reporter: Joseph Adler Priority: Critical Fix For: 0.12 I'm currently working on PIG-3015 and am having trouble passing the UDFContextSignature to the store func. It looks like the store func on the head end is being set to a non-null value, but a null value is being passed to setStoreFuncUDFContextSignature on the back end. I'm opening this ticket to track this issue; I'll follow up with a reproducible test case when I have a clean one. I suspect this problem occurs when running on a real cluster, but may not occur in the standard unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3120) setStoreFuncUDFContextSignature called with null signature
[ https://issues.apache.org/jira/browse/PIG-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3120: -- Status: Patch Available (was: Open) setStoreFuncUDFContextSignature called with null signature -- Key: PIG-3120 URL: https://issues.apache.org/jira/browse/PIG-3120 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.12 Reporter: Joseph Adler Priority: Critical Fix For: 0.12 I'm currently working on PIG-3015 and am having trouble passing the UDFContextSignature to the store func. It looks like the store func on the head end is being set to a non-null value, but a null value is being passed to setStoreFuncUDFContextSignature on the back end. I'm opening this ticket to track this issue; I'll follow up with a reproducible test case when I have a clean one. I suspect this problem occurs when running on a real cluster, but may not occur in the standard unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3120) setStoreFuncUDFContextSignature called with null signature
[ https://issues.apache.org/jira/browse/PIG-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3120: -- Status: Open (was: Patch Available) setStoreFuncUDFContextSignature called with null signature -- Key: PIG-3120 URL: https://issues.apache.org/jira/browse/PIG-3120 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.12 Reporter: Joseph Adler Priority: Critical Fix For: 0.12 I'm currently working on PIG-3015 and am having trouble passing the UDFContextSignature to the store func. It looks like the store func on the head end is being set to a non-null value, but a null value is being passed to setStoreFuncUDFContextSignature on the back end. I'm opening this ticket to track this issue; I'll follow up with a reproducible test case when I have a clean one. I suspect this problem occurs when running on a real cluster, but may not occur in the standard unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3120) setStoreFuncUDFContextSignature called with null signature
[ https://issues.apache.org/jira/browse/PIG-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3120: -- Attachment: PIG-3120.patch This patch resolves an issue with UDF StoreFunc signatures when using LIMIT statements setStoreFuncUDFContextSignature called with null signature -- Key: PIG-3120 URL: https://issues.apache.org/jira/browse/PIG-3120 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.12 Reporter: Joseph Adler Priority: Critical Fix For: 0.12 Attachments: PIG-3120.patch I'm currently working on PIG-3015 and am having trouble passing the UDFContextSignature to the store func. It looks like the store func on the head end is being set to a non-null value, but a null value is being passed to setStoreFuncUDFContextSignature on the back end. I'm opening this ticket to track this issue; I'll follow up with a reproducible test case when I have a clean one. I suspect this problem occurs when running on a real cluster, but may not occur in the standard unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551502#comment-13551502 ] Joseph Adler commented on PIG-3015: --- Just got bitten by PIG-2266 while doing some performance testing with this ticket. I'm going to add that fix to this patch so that AvroStorage and TrevniStorage actually work. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, TestInput.java, Test.java The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546164#comment-13546164 ] Joseph Adler commented on PIG-3015: --- Hi Cheolsoo: What size file are you using? You can configure the sync interval with the parameter avro.mapred.sync.interval (defined in org.apache.avro.mapred.AvroOutputFormat), and implemented in my latest patch (the one from last week). -- Joe Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, Test.tar.gz The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015-5.patch Added fixes for compression (and other metadata) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: (was: PIG-3015.patch) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: (was: PIG-3015-5.patch) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015-5.patch Oops, this one contains the changes. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-3015 Rewrite of AvroStorage
/recordsWithDoubleUnderscores.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/simpleRecordsTrevni.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/testDirectory.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/testDirectoryCounts.avsc PRE-CREATION test/unit-tests 7cede06 Diff: https://reviews.apache.org/r/8104/diff/ Testing --- Thanks, Joseph Adler
Re: Review Request: PIG-3015 Rewrite of AvroStorage
/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/simpleRecordsTrevni.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/testDirectory.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/testDirectoryCounts.avsc PRE-CREATION test/unit-tests 7cede06 Diff: https://reviews.apache.org/r/8104/diff/ Testing --- Thanks, Joseph Adler
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544248#comment-13544248 ] Joseph Adler commented on PIG-3015: --- Hi Russ, I think you're right... it looks like you could do something like this in AvroRecordReader.nextKeyValue: {code} @Override public boolean nextKeyValue() throws IOException, InterruptedException { if (reader.pastSync(end)) { return false; } try { currentRecord = reader.next(new GenericData.Record(schema)); } catch (NoSuchElementException e) { return false; } catch (IOException ioe) { reader.sync(reader.tell()+1); throw ioe; } return true; } {code} Let me test this out to make sure it runs correctly on uncorrupted files. Would you mind creating a corrupted test file that I can use for testing? Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543455#comment-13543455 ] Joseph Adler commented on PIG-3015: --- Hi Cheolsoo, You're totally right; I don't check the compression properties. I know that the avro mapred library does check those parameters (org.apache.avro.mapred.AvroOutPutFormat), but I don't use that output format. Fixing and testing, will follow up with a patch. -- Joe Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3059) Global configurable minimum 'bad record' thresholds
[ https://issues.apache.org/jira/browse/PIG-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542718#comment-13542718 ] Joseph Adler commented on PIG-3059: --- Sorry to take so long to get back to this. It was a long break from work... Thanks so much for taking this over. I like the way you've implemented this. Global configurable minimum 'bad record' thresholds --- Key: PIG-3059 URL: https://issues.apache.org/jira/browse/PIG-3059 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.11 Reporter: Russell Jurney Assignee: Cheolsoo Park Fix For: 0.12 Attachments: avro_test_files-2.tar.gz, PIG-3059-2.patch, PIG-3059.patch See PIG-2614. Pig dies when one record in a LOAD of a billion records fails to parse. This is almost certainly not the desired behavior. elephant-bird and some other storage UDFs have minimum thresholds in terms of percent and count that must be exceeded before a job will fail outright. We need these limits to be configurable for Pig, globally. I've come to realize what a major problem Pig's crashing on bad records is for new Pig users. I believe this feature can greatly improve Pig. An example of a config would look like: pig.storage.bad.record.threshold=0.01 pig.storage.bad.record.min=100 A thorough discussion of this issue is available here: http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: (was: PIG-3015.patch) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-3015 Rewrite of AvroStorage
/recordsOfArrays.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithDoubleUnderscores.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/simpleRecordsTrevni.avsc PRE-CREATION test/unit-tests 7cede06 Diff: https://reviews.apache.org/r/8104/diff/ Testing --- Thanks, Joseph Adler
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: (was: PIG-3015.patch) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015.patch Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13534212#comment-13534212 ] Joseph Adler commented on PIG-3015: --- My apologies; forgot to add those to the patch. Replaced the patch version. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-3015 Rewrite of AvroStorage
test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithDoubleUnderscores.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/simpleRecordsTrevni.avsc PRE-CREATION test/unit-tests 7cede06 Diff: https://reviews.apache.org/r/8104/diff/ Testing --- Thanks, Joseph Adler
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13529293#comment-13529293 ] Joseph Adler commented on PIG-3015: --- Ivy should be able to pull the jar from a maven repo. Do you need to build your own Avro jar from source? Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: (was: PIG-3015.patch) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015.patch Added test cases for TrevniStorage (and made sure the test cases all pass) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Description: The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. was: The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510629#comment-13510629 ] Joseph Adler commented on PIG-3015: --- Hi Johannes, As you probably know, the Avro specification limits the set of valid characters in names (see http://avro.apache.org/docs/current/spec.html#Names). Names must - start with [A-Za-z_] - subsequently contain only [A-Za-z0-9_] So double colons aren't allowed. PIG-2684 proposes using namespaces as the solution. I think that's a poor choice; namespaces are often used for other purposes. Specifically, names spaces are essential if you are writing complicated data processing software that processes multiple types of avro serialized objects. In my experience, the avro schema and protocol compilers produce much better, more usable code if you use name spaces. There are two good workarounds: - The Pig user can rename variables in a bag before storing the bag using AvroStorage - The Pig user can manually specify the output schema before storing the bag with AvroStorage So, here's a specific suggestion: - By default, throw an exception if the pig schema contains a name with a double-colon and the user does not specify an output schema - Add an option to AvroStorage to transform double colons to something else. (Maybe double underscores? Maybe storing them in the namespace?) What do you think? Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2684) :: in field name causes AvroStorage to fail
[ https://issues.apache.org/jira/browse/PIG-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510645#comment-13510645 ] Joseph Adler commented on PIG-2684: --- I'm addressing this right now in PIG-3015. This isn't a bug; it's just a mismatch between the set of names that Avro allows and the names that Pig allows. (As a side note, there are good reasons why only some variable names are allowed in Avro: limiting the characters in names allows Avro to generate code to process Avro objects in a number of different languages. Colons in variable names would make it difficult to do this.) First, there are two workaround for this problem right now: - The user can rename variables before storing the bag - The user can manually specify the output schema Second, I don't like the idea of using namespaces for this. Namespaces are important for specific record types in Avro; they are translated by the protocol and schema compiles into package names for java classes. To make AvroStorage easier to user, I think it would make sense to add an option to AvroStorage to translate names with colons in some reasonable way: maybe translating the double colons to double underscores. :: in field name causes AvroStorage to fail --- Key: PIG-2684 URL: https://issues.apache.org/jira/browse/PIG-2684 Project: Pig Issue Type: Bug Components: piggybank Reporter: Fabian Alenius There appears to be a bug in AvroStorage which causes it to fail when there are field names that contain :: For example, the following will fail: data = load 'test.txt' as (one, two); grp = GROUP data by (one, two); result = foreach grp generate FLATTEN(group); store result into 'test.avro' using org.apache.pig.piggybank.storage.avro.AvroStorage(); ERROR 2999: Unexpected internal error. Illegal character in: group::one While the following will succeed: data = load 'test.txt' as (one, two); grp = GROUP data by (one, two); result = foreach grp generate FLATTEN(group) as (one,two); store result into 'test.avro' using org.apache.pig.piggybank.storage.avro.AvroStorage(); Here is a minimal test case: data = load 'test.txt' as (one::two, three); store data into 'test.avro' using org.apache.pig.piggybank.storage.avro.AvroStorage(); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015.patch I added support for files that don't have records, added option for dealing with double colons in variable names. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-3015 Rewrite of AvroStorage
On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote: Overall looks great! I haven't gone through the test cases yet, but here are my comments so far. 1) I noticed that I cannot load .avro files that are not record types. For example, I tried to load a .avro file whose schema is int as follows: [cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar getschema foo2/test_int.avro int [cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar tojson foo2/test_int.avro 1 in = LOAD 'foo2/test_int.avro' USING AvroStorage('int'); DUMP in; This gives me the following error: Caused by: java.io.IOException: avroSchemaToResourceSchema only processes records Can only Avro record type be loaded? Or am I doing something wrong? 2) TestAvroStorage needs to be more automated. To run it, I had to run the following commands: ant clean compile-test cd ./test/org/apache/pig/builtin/avro python createests.py cd - ant clean test -Dtestcase=TestAvroStorage Ideally, I should be able to run a single command: ant clean -Dtestcase=TestAvroStorage. Please let me know if you need help for this. 3) python createests.py fails with the following errors. I suppose that some files are missing: creating data/avro/uncompressed/testDirectoryCounts.avro Exception in thread main java.io.FileNotFoundException: data/json/testDirectoryCounts.json (No such file or directory) ... creating evenFileNameTestDirectoryCounts.avro Exception in thread main java.io.FileNotFoundException: data/json/evenFileNameTestDirectoryCounts.json (No such file or directory) ... 4) ant test -Dtestcase=TestAvroStorage fails with the following errors. I suppose that this is due to the missing files: Testcase: testLoadDirectory took 0.005 sec FAILED Testcase: testLoadGlob took 0.004 sec FAILED Testcase: testPartialLoadGlob took 0.005 sec FAILED 5) Typo in the name of createests.py. It should be createtests.py. 6) Is createTests.bash needed at all? If not, can you remove it? I have more comments inline: Sounds like the python script isn't working completely correctly. I'll debug that script and make sure it generates all the required files. Can I take you up on your offer to help automate that build process? I'm not exactly sure what to modify to automatically run the python script to create the test files. On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote: src/org/apache/pig/builtin/AvroStorage.java, lines 296-305 https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line296 This won't work in the following case. Let's say p matches two dirs, and one dir is empty. p = foo* foo1 foo2/bar.avro I would expect the schema of bar.avro is returned, but I get an IOException instead. Added proper depth first search to find the first file. (I decided to sort by modification date, most recent first.) - Joseph --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8104/#review13962 --- On Nov. 17, 2012, 5:28 a.m., Joseph Adler wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8104/ --- (Updated Nov. 17, 2012, 5:28 a.m.) Review request for pig and Cheolsoo Park. Description --- The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. This is the latest version of the patch, complete with test cases and TrevniStorage. (Test cases for TrevniStorage are still missing). This addresses bug PIG-3015. https://issues.apache.org/jira/browse/PIG-3015 Diffs - build.xml 7d468a0 ivy.xml 70e8d50 ivy/libraries.properties 317564f src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION src/org/apache/pig/impl/util/AvroBagWrapper.java PRE-CREATION src/org/apache/pig/impl/util/AvroMapWrapper.java PRE-CREATION src/org/apache/pig/impl/util/AvroRecordReader.java PRE-CREATION src/org/apache/pig/impl/util/AvroRecordWriter.java PRE-CREATION src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java PRE-CREATION src/org/apache/pig
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509992#comment-13509992 ] Joseph Adler commented on PIG-3015: --- I think that approach makes sense; each object in a file should be wrapped in a Tuple. Suppose that a file example.avro contained the data: {[1, 2, 3, 4, 5]} {[6, 7, 8, 9, 10]} and had this schema: {name : IntArray, type : array, items : int}, and we loaded this as A = LOAD 'example.avro' USING AvroStorage; The bag A would have the Pig schema A:{(IntArray:{(int)})}; it would contain two tuples, which would in turn each contain one bag of integers. Does that sound correct? If so, I'll go implement that. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-3015 Rewrite of AvroStorage
On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote: src/org/apache/pig/builtin/AvroStorage.java, lines 171-172 https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line171 Same problem as above. Fixing this one within getAvroSchema On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote: src/org/apache/pig/builtin/AvroStorage.java, lines 382-388 https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line382 Is this needed? In the constructor, schema is supposed to be set. If not, there must be an error. Shouldn't we throw an exception instead of re-trying to set schema? Please correct me if I am wrong. Pretty sure you're right about this one (and that this code is redundant). On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote: src/org/apache/pig/builtin/TrevniStorage.java, line 160 https://reviews.apache.org/r/8104/diff/1/?file=191565#file191565line160 AvroStorage accepts files that do not end .avro. Shouldn't TrevniStorage do the same? Good point... though I realize that I've defined visible avro files and visible trevni files methods that are probably not useful. I should probably just drop the methods. On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote: src/org/apache/pig/impl/util/AvroRecordReader.java, lines 110-118 https://reviews.apache.org/r/8104/diff/1/?file=191568#file191568line110 I can't find where -ignoreErrors is used. I guess that error handling for bad files is not implemented yet? No, I haven't implemented it yet. I suspect that the best way to implement the error ignoring functionality is from within Pig, and should apply to all file types (not just Avro)... I'll add that discussion to the right JIRA thread On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote: src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java, lines 85-91 https://reviews.apache.org/r/8104/diff/1/?file=191571#file191571line85 How about a union type that contains a single data type (e.g. [string])? They're currently supported. Good point; that's a trivial change. Adding that On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote: src/org/apache/pig/impl/util/AvroTupleWrapper.java, line 163 https://reviews.apache.org/r/8104/diff/1/?file=191572#file191572line163 Can you instead use log.debug(..., e)? Just added the exception to the line logging line above - Joseph --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8104/#review13962 --- On Nov. 17, 2012, 5:28 a.m., Joseph Adler wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8104/ --- (Updated Nov. 17, 2012, 5:28 a.m.) Review request for pig and Cheolsoo Park. Description --- The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. This is the latest version of the patch, complete with test cases and TrevniStorage. (Test cases for TrevniStorage are still missing). This addresses bug PIG-3015. https://issues.apache.org/jira/browse/PIG-3015 Diffs - build.xml 7d468a0 ivy.xml 70e8d50 ivy/libraries.properties 317564f src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION src/org/apache/pig/impl/util/AvroBagWrapper.java PRE-CREATION src/org/apache/pig/impl/util/AvroMapWrapper.java PRE-CREATION src/org/apache/pig/impl/util/AvroRecordReader.java PRE-CREATION src/org/apache/pig/impl/util/AvroRecordWriter.java PRE-CREATION src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java PRE-CREATION src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java PRE-CREATION src/org/apache/pig/impl/util/AvroTupleWrapper.java PRE-CREATION test/commit-tests 5081fbc test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION test
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509296#comment-13509296 ] Joseph Adler commented on PIG-3015: --- I made most of the recommended changes (thanks for looking this over), and have a follow up question: I have always assumed that AvroStorage was designed to be used with Hadoop sequence files that contained a series of records, so I implemented AvroStorage to only work with a file in this format. Are there cases where the highest level schema for a file will be another type? If so... what does that mean for pig? Is there one record per file? Here's a specific example: suppose that we have this schema: {name : IntArray, type : array, items : int} Suppose that we have 3 files to load, each with this schema, each containing an array of 10 integers. Should we load this into pig as a single bag with 30 integers? A bag containing three bags (each, in turn, containing 10 integers)? Or reject this file entirely? Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error
[ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509325#comment-13509325 ] Joseph Adler commented on PIG-2614: --- Could I propose an alternative? I like this functionality, but I don't think that this should be specific to Avro records. I think that is should be straightforward to modify org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader to implement this functionality for ALL LoadFunc types. Specifically, it should be possible to count the number of Exceptions thrown by the getNext method in the underlying load function (inside PigRecordReader.nextKeyValue). AvroStorage crashes on LOADING a single bad error - Key: PIG-2614 URL: https://issues.apache.org/jira/browse/PIG-2614 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.10.0, 0.11 Reporter: Russell Jurney Assignee: Jonathan Coveney Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism Fix For: 0.11, 0.10.1 Attachments: PIG-2614_0.patch, PIG-2614_1.patch, PIG-2614_2.patch, test_avro_files.tar.gz AvroStorage dies when a single bad record exists, such as one with missing fields. This is very bad on 'big data,' where bad records are inevitable. See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Status: Open (was: Patch Available) replacing with revised patch Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: (was: PIG-3015.patch) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Status: Patch Available (was: Open) Revised patch; reflects comments and suggestions from review board Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015.patch Revised patch (compiles together all changes) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506099#comment-13506099 ] Joseph Adler commented on PIG-3015: --- Hi Timothy: I have not tried the patch with Pig 0.10, but I don't know of any reason why it would not work. Give it a spin and let us know what happens. -- Joe Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error
[ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506101#comment-13506101 ] Joseph Adler commented on PIG-2614: --- Repeating an old question: is there any reason that this patch is only for Avro? I think this could work for all storage types. AvroStorage crashes on LOADING a single bad error - Key: PIG-2614 URL: https://issues.apache.org/jira/browse/PIG-2614 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.10.0, 0.11 Reporter: Russell Jurney Assignee: Jonathan Coveney Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism Fix For: 0.11, 0.10.1 Attachments: PIG-2614_0.patch, PIG-2614_1.patch AvroStorage dies when a single bad record exists, such as one with missing fields. This is very bad on 'big data,' where bad records are inevitable. See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: LOAD multiple files with glob
It's a total rewrite, so it hasn't exactly made it in. But yes, file globs should work correctly. That's one of the unit tests. (All of the unit tests pass, incidentally.) On Mon, Nov 26, 2012 at 10:23 AM, Russell Jurney russell.jur...@gmail.comwrote: Is the globbing feature making it into the AvroStorage rewrite? Russell Jurney twitter.com/rjurney On Nov 26, 2012, at 7:50 AM, Bart Verwilst li...@verwilst.be wrote: To answer myself again, I compiled Pig 0.11 and Piggybank, and it's working very well now, globbing seems to be fully supported! Bart Verwilst schreef op 26.11.2012 15:33: To answer myself, could this be part of the solution? : https://issues.apache.org/jira/browse/PIG-2492 Guess I'll have to wait for 0.11 then? Bart Verwilst schreef op 26.11.2012 14:19: 14:16:08 centos6-hadoop-hishiru ~ $ cat avro-test.pig REGISTER 'hdfs:///lib/avro-1.7.2.jar'; REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; REGISTER 'hdfs:///lib/piggybank.jar'; DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); avro = load '/test/*' USING AvroStorage(); describe avro; 14:16:09 centos6-hadoop-hishiru ~ $ pig avro-test.pig Schema for avro unknown. 14:16:17 centos6-hadoop-hishiru ~ $ vim avro-test.pig 14:16:25 centos6-hadoop-hishiru ~ $ cat avro-test.pig REGISTER 'hdfs:///lib/avro-1.7.2.jar'; REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; REGISTER 'hdfs:///lib/piggybank.jar'; DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); avro = load '/test/2012-11-25.avro' USING AvroStorage(); describe avro; 14:16:30 centos6-hadoop-hishiru ~ $ pig avro-test.pig avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: int,heading: int,terminalid: int,customerid: chararray,mileage: int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id: long,value: chararray,pkey: chararray)}} 14:16:55 centos6-hadoop-hishiru ~ $ hadoop fs -ls /test/ Found 1 items -rw-r--r-- 3 hdfs supergroup 63140500 2012-11-26 14:13 /test/2012-11-25.avro Cheolsoo Park schreef op 26.11.2012 10:45: Hi, Invalid field projection. Projected field [tracetype] does not exist. The error indicates that the tracetype doesn't exist in the Pig schema of the relation avro. What AvroStorage does is to automatically convert Avro schema to Pig schema during the load. Although you have tracetype in your Avro schema, tracetype doesn't exist in the generated Pig schema for whatever reason. Can you please try to describe avro? You can replace group and dump commands with describe in your Pig script. This will show you what the Pig schema of avro is. If tracetype indeed doesn't exist, you have to find out why it doesn't. It could be because the schema of .avro files is not the same or because there is a bug in AvroStorage, etc. Maybe globbing with [] doesnt work, but wildcard works? You're right. AvroStorage internally uses Hadoop path globing, and Hadoop path globing doesn't support '[ ]'. But the above error (Projected field [tracetype] does not exist) is not because of this. URISyntaxException is what you will get because of '[ ]'. Thanks, Cheolsoo On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst li...@verwilst.be wrote: Just tried this: --**-- REGISTER 'hdfs:///lib/avro-1.7.2.jar'; REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; REGISTER 'hdfs:///lib/piggybank.jar'; DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage(); avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING AvroStorage(); groups = group avro by tracetype; dump groups; --**-- gave me: file avro-test.pig, line 10, column 23 Invalid field projection. Projected field [tracetype] does not exist. Pig Stack Trace --- ERROR 1025: file avro-test.pig, line 10, column 23 Invalid field projection. Projected field [tracetype] does not exist. org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066: Unable to open iterator for alias groups at org.apache.pig.PigServer.**openIterator(PigServer.java:**862) at org.apache.pig.tools.grunt.**GruntParser.processDump(** GruntParser.java:682) at org.apache.pig.tools.**pigscript.parser.** PigScriptParser.parse(**PigScriptParser.java:303) at org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** GruntParser.java:189) at org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** GruntParser.java:165) at org.apache.pig.tools.grunt.**Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.**java:555) at org.apache.pig.Main.main(Main.**java:111) at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method) at
[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error
[ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501339#comment-13501339 ] Joseph Adler commented on PIG-2614: --- Just taking a look at this patch now If I'm reading the code correctly, it should not be necesarry for the author of a UDF (specifically a LoadFunc) to do anything special to take advantage of this functionality. Is that correct? AvroStorage crashes on LOADING a single bad error - Key: PIG-2614 URL: https://issues.apache.org/jira/browse/PIG-2614 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.10.0, 0.11 Reporter: Russell Jurney Assignee: Jonathan Coveney Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism Fix For: 0.11, 0.10.1 Attachments: PIG-2614_0.patch, PIG-2614_1.patch AvroStorage dies when a single bad record exists, such as one with missing fields. This is very bad on 'big data,' where bad records are inevitable. See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501340#comment-13501340 ] Joseph Adler commented on PIG-3015: --- I just took at look at PIG-2614. It looks like the PIG-2614 patch will be compatible with this patch; PIG-2614 simply counts errors as values are read from a LoadFunc. Am I missing something? I'd be happy to drop the option to ignore bad records; I think that would make the options for this function cleaner and easier to understand. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: PIG-3015 Rewrite of AvroStorage
-CREATION test/org/apache/pig/builtin/avro/data/json/recursiveRecord.json PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordWithRepeatedSubRecords.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/records.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsAsOutputByPig.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsOfArrays.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc PRE-CREATION test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION test/unit-tests 0f18a0e Diff: https://reviews.apache.org/r/8104/diff/ Testing --- Thanks, Joseph Adler
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13499348#comment-13499348 ] Joseph Adler commented on PIG-3015: --- I have made all the changes that you suggested (including rewriting the script that builds test cases in Python) and have uploaded the new version to the RB: https://reviews.apache.org/r/8104/ Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on PIG-3015 started by Joseph Adler. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Status: Patch Available (was: In Progress) Here is a patch with a working implementation (plus new unit tests and a bash script to generate the test data files; just run the bash script in the test/org/apache/pig/builtin/avro directory to generate all the avro files needed for testing) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Patch Info: Patch Available Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Adler updated PIG-3015: -- Attachment: PIG-3015.patch Here's the generated patch file. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496730#comment-13496730 ] Joseph Adler commented on PIG-3015: --- Just TestAvroStorage, yes. I'm not trying to rewrite the whole test system, just clean up the AvroStorage tests. And yes, I'd want to either make an exception for corrupted Avro files or have a job that corrupts the files. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493550#comment-13493550 ] Joseph Adler commented on PIG-3015: --- I hate breaking backwards compatibility. (One of the reaons for doing the rewrite is that Avro broke backwards compatibility.) But I think we have some good reasons to do so here: - Options for AvroStorage are very different than options for other storage functions in Pig. In moving AvroStorage to builtin, it makes sense for AvroStorage to behave as close as possible to PigStorage, etc. - The huge number of crazy options make the code slow and complicated. - There are good workarounds for many changes in the options. For example, all the weird stuff about selecting a schema using an index could be easily changed to explicit schema definitions. - It gets harder to make changes with time. This is probably the best opportunity to make the options simpler and clearer. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13491741#comment-13491741 ] Joseph Adler commented on PIG-3015: --- I put the code in o.a.impl.util. Not a big deal to move it later if that's the preferred style. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487954#comment-13487954 ] Joseph Adler commented on PIG-3015: --- Before addressing the questions, I wanted to propose a naming schema for the load and store functions. To be consistent with other Pig UDFs, I think it makes more sense to use different function names rather than passing different types of arguments to the UDF. Can I propose something like this: LoadFuncs: - AvroStorage. May be instantiated with zero, one, or two arguments. If called with no arguments, the function will load the schema from the most recent data file found in the specified path and use that schema. If called with one argument, the argument will be a String that specifies the input schema. The String may either contain the schema definition, may be a URI that refers to the location of the input schema in a file, or may be an example data file from which to read the schema. If two arguments are specified, the first argument refers to the type of the output records (the name of the type) and the second argument may be either a JSON string, a URI for a schema definition file, or a URI for an example file that contains the definition of that type. This function does not check schema compatibility of input files or allow recursive schema definitions. Fails when corrupted files are encountered. - AvroStorage.AllowRecursive. Same as above, except this function does not check schema compatibility of input files but does allow recursive schema definitions. Recursively defined records are just defined as schemaless tuples in the Pig Schema. - AvroStorage.IgnoreCorrupted Same as above, except this function will not allow recursive schema definitions, but will not fail on corrupted input files. - AvroStorage.AllowRecursiveAndIgnoreCorrupted Same as above, except this function allows recursive definitions and does not fail on corrupted input files. StoreFunc: - AvroStorage. May be instantiated with zero, one, or two arguments; the meaning of the arguments can be inferred from how they are specified. If called with no arguments, the function will translate the pig schema to an Avro schema, use a default name for the record types, and not assign a namespace to the records. If called with one argument, the argument will be a String that may specify the output schema, or may specify the record name for the output records. If the string specifies the schema definition, may be a URI that refers to the location of the input schema in a file, or may be an example data file from which to reuse the schema. If two arguments are specified, they may refer to the name and namespace for the output records. Alternately, the first argument may refer to the type of the output records (the name of the schema), and the second argument may be either a JSON string, a URI for a schema definition file, or a URI for an example file that contains the definition of that type. Answers to questions: LoadFunc 1a: Yes, the storage function will convert avro schemas to pig schemas, and vice versa. I haven't tried to convert multiple compatible but different schemas to one pig schema. I believe that if you manually supply a schema to the function that is a superset of all the schemas in the input data, the underlying Avro libraries will take care of this for you... though this brings up another question: what does compatible mean in this case? Personally, I do not think that the core Pig library should attempt to resolve this problem for users; I think it is best for users to load files with different load functions, cast and rename fields as appropriate in pig code, then take a union of the values. It's possible to miss real (and important) errors if Pig does a lot of type conversions and manipulations under the covers. LoadFunc 2: I think this is necessary for a few reasons: It's faster to supply a schema manually (the Pig run time doesn't have to read files from HDFS at planning time to detect the schema). By specifying the schema, you can also specify a subset of fields to de-serialize, reducing the size of the input data. Finally, by specifying a schema manually, you can read a set of files with compatible but different schemas. I think PIG-2875 is a design mistake. If I had been involved in the project, I would have argued hard against this. You can't specify a recursive schema in Pig, so why allow users to load files with recursive schemas in Pig? It is possible to load recursively defined records into pig, but that seems like a recipe for confusion and errors. By default, recursive schema definitions should result in an error, or at least a warning message. I'd propose that this be allowed only as an option. Storefunc 2a: I don't think it's hard to specfiy those three options. It's probably OK for the StoreFunc
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487179#comment-13487179 ] Joseph Adler commented on PIG-3015: --- Just reading through the discussion on the user list. I'll check out trunk, refactor/rename as needed, make sure it passes existing tests, fix bugs, then submit the patches. That will probably take me a few days to do. Additionally, I'd like to get a few things correct the first time. Specifically, I'm trying to figure out how to deal with the plethora of possible options for load/store functions. I want to make sure that I cover all the important use cases regarding schemas. Here's the list that I came up with: LoadFunc: (1) Read the schema from the input file(s) (a) Just pick the schema from the most recent file (b) Check all the files to make sure the schemas are compatible (2) Use a schema manually provided by the user StoreFunc: (1) Automatically translate the Pig schema to an Avro Schema (2) Use a schema manually provided by the user (a) Allow the user to name the records and name space (b) Automatically pick a record and namespace name Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3015) Rewrite of AvroStorage
Joseph Adler created PIG-3015: - Summary: Rewrite of AvroStorage Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486489#comment-13486489 ] Joseph Adler commented on PIG-3015: --- Here's the working version: https://github.com/josephadler/fast-avro-storage I can break that up into multiple Jira tickets, though that feels like a lot of extra work; I threw away all the existing code and started from scratch. I do think it's reasonable to separate AvroStorage and TrevniStorage for now (though they are very closely related) Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni. I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Release Pig 0.10.0 (candidate 0)
Can you guys please fix https://issues.apache.org/jira/browse/PIG-2266 Without that, I can guarantee that AvroStorage will fail on large files. On Mon, Apr 23, 2012 at 8:07 PM, Joseph Adler joseph.ad...@me.com wrote: I will do it tomorrow on one of my workflows. Could take some trial and error to get it working. -- Joe On Apr 23, 2012, at 6:53 PM, Russell Jurney russell.jur...@gmail.com wrote: Can someone from LinkedIn try this release candidate? It may break your AvroStorage, so that would be good to know. Russell Jurney http://datasyndrome.com On Apr 23, 2012, at 6:36 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: +1 Verified several jobs using Elephant-Bird loaders. Tested correctness with pig.exec.mapPartAgg both true and false. Verified license. Verified release notes. Ran test-commit D On Sat, Apr 21, 2012 at 12:27 PM, Daniel Dai da...@hortonworks.com wrote: We should do sanity check of the package, such as unit tests, e2e tests, piggybank tests, package integrity, package signature, license, etc. However, if we find a new bug, usually we will push it to the next release at this stage unless it is a critical one. Thanks, Daniel On Sat, Apr 21, 2012 at 12:48 AM, Prashant Kommireddi prash1...@gmail.com wrote: Hi Daniel, What is required other than running the regular tests for testing release candidate? I can think of running a few existing scripts against candidate build and making sure outputs look fine. Thanks, Prashant On Fri, Apr 20, 2012 at 12:39 AM, Daniel Dai da...@hortonworks.com wrote: Hi, I have created a candidate build for Pig 0.10.0. Keys used to sign the release are available at http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup. Please download, test, and try it out: http://people.apache.org/~daijy/pig-0.10.0-candidate-0/ Should we release this? Vote closes on next Tuesday, Apr 24th. Daniel
Re: [VOTE] Release Pig 0.10.0 (candidate 0)
I will do it tomorrow on one of my workflows. Could take some trial and error to get it working. -- Joe On Apr 23, 2012, at 6:53 PM, Russell Jurney russell.jur...@gmail.com wrote: Can someone from LinkedIn try this release candidate? It may break your AvroStorage, so that would be good to know. Russell Jurney http://datasyndrome.com On Apr 23, 2012, at 6:36 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: +1 Verified several jobs using Elephant-Bird loaders. Tested correctness with pig.exec.mapPartAgg both true and false. Verified license. Verified release notes. Ran test-commit D On Sat, Apr 21, 2012 at 12:27 PM, Daniel Dai da...@hortonworks.com wrote: We should do sanity check of the package, such as unit tests, e2e tests, piggybank tests, package integrity, package signature, license, etc. However, if we find a new bug, usually we will push it to the next release at this stage unless it is a critical one. Thanks, Daniel On Sat, Apr 21, 2012 at 12:48 AM, Prashant Kommireddi prash1...@gmail.com wrote: Hi Daniel, What is required other than running the regular tests for testing release candidate? I can think of running a few existing scripts against candidate build and making sure outputs look fine. Thanks, Prashant On Fri, Apr 20, 2012 at 12:39 AM, Daniel Dai da...@hortonworks.com wrote: Hi, I have created a candidate build for Pig 0.10.0. Keys used to sign the release are available at http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup. Please download, test, and try it out: http://people.apache.org/~daijy/pig-0.10.0-candidate-0/ Should we release this? Vote closes on next Tuesday, Apr 24th. Daniel
[jira] [Created] (PIG-2378) macros don't accept references to items within tuples as arguments
macros don't accept references to items within tuples as arguments -- Key: PIG-2378 URL: https://issues.apache.org/jira/browse/PIG-2378 Project: Pig Issue Type: Improvement Affects Versions: 0.9.1 Reporter: Joseph Adler I'd like to be able to pass a reference to an item within a parameter to a Pig Macro. For example, suppose that I had a relation A with the schema A:{id:long, header:(time:long, type:chararray)}. I'd like to call a macro by typing: B = MY_MACRO(A, header.time); but this does not currently work. Obviously, I could define a new relation as a workaround, for example I could use some pig code like AA = FOREACH a GENERATE *, header.time as time; B = MY_MACRO(AA, time); But that's ugly and clunky -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2266) bug with input file joining optimization in Pig
[ https://issues.apache.org/jira/browse/PIG-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098408#comment-13098408 ] Joseph Adler commented on PIG-2266: --- Index: MRCompiler.java === --- MRCompiler.java (revision 1165764) +++ MRCompiler.java (working copy) @@ -1353,7 +1353,8 @@ .instantiateFuncFromSpec(ld.getLFile() .getFuncSpec()); Job job = new Job(conf); -loader.setLocation(location, job); +loader.setUDFContextSignature(ld.getSignature()); + loader.setLocation(location, job); InputFormat inf = loader.getInputFormat(); ListInputSplit splits = inf.getSplits(HadoopShims.cloneJobContext(job)); ListListInputSplit results = MapRedUtil bug with input file joining optimization in Pig --- Key: PIG-2266 URL: https://issues.apache.org/jira/browse/PIG-2266 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Joseph Adler In src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java, the function hasTooManyInputFiles instantiated a LoadFunc instance, then calls setLocation before calling setUDFContextSignature. This is inconsistent with the documentation for the LoadFunc interface (see http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/LoadFunc.html#setUDFContextSignature(java.lang.String)). (We've written UDFs that assume that setUDFContextSignature is called first.) I think you can fix this by adding loader.setUDFContextSignature(ld.getSignature()); Before loader.setLocation(location, job); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira