[jira] [Updated] (PIG-3526) Unions with Enums do not work with AvroStorage

2013-10-17 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3526:
--

Attachment: PIG-3526.patch

Patch for this issue.

 Unions with Enums do not work with AvroStorage
 --

 Key: PIG-3526
 URL: https://issues.apache.org/jira/browse/PIG-3526
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Affects Versions: 0.12.0
Reporter: Joseph Adler
 Fix For: 0.12.1

 Attachments: PIG-3526.patch


 If you have an input schema with unions of enum types and nulls, AvroStorage 
 can't read the data correctly. This patch will translate the enums to strings 
 so that Pig can process them.
 (Sorry for the short description and lack of a unit test; ran into this issue 
 while working on a deadline for another project.) 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (PIG-3526) Unions with Enums do not work with AvroStorage

2013-10-17 Thread Joseph Adler (JIRA)
Joseph Adler created PIG-3526:
-

 Summary: Unions with Enums do not work with AvroStorage
 Key: PIG-3526
 URL: https://issues.apache.org/jira/browse/PIG-3526
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Affects Versions: 0.12.0
Reporter: Joseph Adler
 Fix For: 0.12.1
 Attachments: PIG-3526.patch

If you have an input schema with unions of enum types and nulls, AvroStorage 
can't read the data correctly. This patch will translate the enums to strings 
so that Pig can process them.

(Sorry for the short description and lack of a unit test; ran into this issue 
while working on a deadline for another project.) 




--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag

2013-10-07 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788655#comment-13788655
 ] 

Joseph Adler commented on PIG-3377:
---

Working on this now...

 New AvroStorage throws NPE when storing untyped map/array/bag
 -

 Key: PIG-3377
 URL: https://issues.apache.org/jira/browse/PIG-3377
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Reporter: Cheolsoo Park
Assignee: Joseph Adler
 Fix For: 0.12.1


 The following example demonstrates the issue:
 {code}
 a = LOAD 'foo' AS (m:map[]);
 STORE a INTO 'bar' USING AvroStorage();
 {code}
 This fails with the following error:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462)
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335)
 at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472)
 {code}
 Similarly, untyped bag causes the following error:
 {code}
 Caused by: java.lang.NullPointerException
 at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722)
 ...
 at org.apache.avro.Schema.getElementType(Schema.java:256)
 at 
 org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491)
 {code}
 The problem is that AvroStorage cannot derive the output schema from untyped 
 map/bag/tuple. When type is not defined, it should be assumed as bytearray.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag

2013-10-07 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3377:
--

Status: Patch Available  (was: Open)

 New AvroStorage throws NPE when storing untyped map/array/bag
 -

 Key: PIG-3377
 URL: https://issues.apache.org/jira/browse/PIG-3377
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Reporter: Cheolsoo Park
Assignee: Joseph Adler
 Fix For: 0.12.1

 Attachments: PIG-3377.patch


 The following example demonstrates the issue:
 {code}
 a = LOAD 'foo' AS (m:map[]);
 STORE a INTO 'bar' USING AvroStorage();
 {code}
 This fails with the following error:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462)
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335)
 at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472)
 {code}
 Similarly, untyped bag causes the following error:
 {code}
 Caused by: java.lang.NullPointerException
 at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722)
 ...
 at org.apache.avro.Schema.getElementType(Schema.java:256)
 at 
 org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491)
 {code}
 The problem is that AvroStorage cannot derive the output schema from untyped 
 map/bag/tuple. When type is not defined, it should be assumed as bytearray.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag

2013-10-07 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3377:
--

Attachment: PIG-3377.patch

Patch for this issue (provides a meaningful error message)

 New AvroStorage throws NPE when storing untyped map/array/bag
 -

 Key: PIG-3377
 URL: https://issues.apache.org/jira/browse/PIG-3377
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Reporter: Cheolsoo Park
Assignee: Joseph Adler
 Fix For: 0.12.1

 Attachments: PIG-3377.patch


 The following example demonstrates the issue:
 {code}
 a = LOAD 'foo' AS (m:map[]);
 STORE a INTO 'bar' USING AvroStorage();
 {code}
 This fails with the following error:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462)
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335)
 at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472)
 {code}
 Similarly, untyped bag causes the following error:
 {code}
 Caused by: java.lang.NullPointerException
 at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722)
 ...
 at org.apache.avro.Schema.getElementType(Schema.java:256)
 at 
 org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491)
 {code}
 The problem is that AvroStorage cannot derive the output schema from untyped 
 map/bag/tuple. When type is not defined, it should be assumed as bytearray.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag

2013-07-17 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711378#comment-13711378
 ] 

Joseph Adler commented on PIG-3377:
---

Want to assign this to me? I can take a look at this and submit a patch.

 New AvroStorage throws NPE when storing untyped map/array/bag
 -

 Key: PIG-3377
 URL: https://issues.apache.org/jira/browse/PIG-3377
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.12


 The following example demonstrates the issue:
 {code}
 a = LOAD 'foo' AS (m:map[]);
 STORE a INTO 'bar' USING AvroStorage();
 {code}
 This fails with the following error:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462)
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335)
 at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472)
 {code}
 Similarly, untyped bag causes the following error:
 {code}
 Caused by: java.lang.NullPointerException
 at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722)
 ...
 at org.apache.avro.Schema.getElementType(Schema.java:256)
 at 
 org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491)
 {code}
 The problem is that AvroStorage cannot derive the output schema from untyped 
 map/bag/tuple. When type is not defined, it should be assumed as bytearray.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3015 Rewrite of AvroStorage

2013-05-20 Thread Joseph Adler


 On March 19, 2013, 4:40 p.m., Jonathan Coveney wrote:
  src/org/apache/pig/builtin/AvroStorage.java, line 352
  https://reviews.apache.org/r/8104/diff/4/?file=244837#file244837line352
 
  I realize using Long's compareTo is convenient, but this seems like 
  unnecessary boxing. why not just compare them directly? I realize this 
  isn't performance critical cord, it just stuck out to me, since you could 
  just do a  instead...

For sorting, you need to implement compare (which tests for , ==, and ). I 
switched to com.google.common.primitives.Longs.compare


 On March 19, 2013, 4:40 p.m., Jonathan Coveney wrote:
  src/org/apache/pig/impl/util/avro/AvroTupleWrapper.java, line 66
  https://reviews.apache.org/r/8104/diff/4/?file=244846#file244846line66
 
  May want to throw an UnsupportedOperationException instead, as if this 
  is being called, it's a more fundamental issue with Pig, separate from 
  write related issues.

Stuck with the exceptions in the existing Tuple interface... but yes, that 
would be more logical


 On March 19, 2013, 4:40 p.m., Jonathan Coveney wrote:
  src/org/apache/pig/impl/util/avro/AvroTupleWrapper.java, line 84
  https://reviews.apache.org/r/8104/diff/4/?file=244846#file244846line84
 
  shouldn't this throw an error? Or is avroObject.put() doing something I 
  don't expect, perhaps being 1-indexed instead of 0-indexed?

I think that write is never called; in the current version it just throws an 
error


- Joseph


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/#review18077
---


On Jan. 4, 2013, 7:22 p.m., Joseph Adler wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/8104/
 ---
 
 (Updated Jan. 4, 2013, 7:22 p.m.)
 
 
 Review request for pig and Cheolsoo Park.
 
 
 Description
 ---
 
 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 
 This is the latest version of the patch, complete with test cases and 
 TrevniStorage. (Test cases for TrevniStorage are still missing).
 
 
 This addresses bug PIG-3015.
 https://issues.apache.org/jira/browse/PIG-3015
 
 
 Diffs
 -
 
   .eclipse.templates/.classpath c7b83b8 
   ivy.xml 70e8d50 
   ivy/libraries.properties 7b07c7e 
   src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION 
   src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION 
   src/org/apache/pig/impl/util/avro/AvroArrayReader.java PRE-CREATION 
   src/org/apache/pig/impl/util/avro/AvroBagWrapper.java PRE-CREATION 
   src/org/apache/pig/impl/util/avro/AvroMapWrapper.java PRE-CREATION 
   src/org/apache/pig/impl/util/avro/AvroRecordReader.java PRE-CREATION 
   src/org/apache/pig/impl/util/avro/AvroRecordWriter.java PRE-CREATION 
   src/org/apache/pig/impl/util/avro/AvroStorageDataConversionUtilities.java 
 PRE-CREATION 
   src/org/apache/pig/impl/util/avro/AvroStorageSchemaConversionUtilities.java 
 PRE-CREATION 
   src/org/apache/pig/impl/util/avro/AvroTupleWrapper.java PRE-CREATION 
   test/commit-tests 5081fbc 
   test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity_blank_first_args.pig 
 PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity_just_ao2.pig 
 PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/namesWithDoubleColons.pig 
 PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/recursive_tests.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/trevni_to_avro.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/trevni_to_trevni.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/data/json/arrays.json PRE-CREATION 
   test/org/apache/pig/builtin/avro/data/json/arraysAsOutputByPig.json 
 PRE-CREATION 
   
 test/org/apache/pig/builtin/avro/data/json/recordWithRepeatedSubRecords.json 
 PRE-CREATION 
   test/org/apache/pig/builtin/avro/data/json/records.json PRE-CREATION 
   test/org/apache

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-05-20 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015-20May2013.diff

I'm getting confused by the names of the diffs. This one is a diff from trunk, 
as of now.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-10.patch, 
 PIG-3015-11.patch, PIG-3015-12.patch, PIG-3015-20May2013.diff, 
 PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, 
 PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc-2.patch, 
 PIG-3015-doc.patch, TestInput.java, Test.java, with_dates.pig


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3330) please fix the change that created a dependency on org.apache.pig.impl.PigImplConstants

2013-05-17 Thread Joseph Adler (JIRA)
Joseph Adler created PIG-3330:
-

 Summary: please fix the change that created a dependency on 
org.apache.pig.impl.PigImplConstants
 Key: PIG-3330
 URL: https://issues.apache.org/jira/browse/PIG-3330
 Project: Pig
  Issue Type: Bug
Reporter: Joseph Adler
Priority: Blocker


I can't build Pig from trunk because several source files (including 
org.apache.pig.Main.java) require org.apache.pig.impl.PigImplConstants, but 
that class isn't available.

I'm assuming someone left out a file on a recent commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-05-17 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015-12.patch

Incremental patch that adds support for push down projections, fixed some bugs 
with options, gets all the test cases working again

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-10.patch, 
 PIG-3015-11.patch, PIG-3015-12.patch, PIG-3015-2.patch, PIG-3015-3.patch, 
 PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, 
 PIG-3015-9.patch, PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, 
 Test.java, with_dates.pig


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-04-30 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645823#comment-13645823
 ] 

Joseph Adler commented on PIG-3015:
---

[~rohini]: Great question. I definitely implemented that interface in an 
earlier version; I'm not sure what happened to the code. Let me go through the 
patches to figure that one out.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-10.patch, 
 PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, 
 PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, 
 with_dates.pig


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-04-30 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645826#comment-13645826
 ] 

Joseph Adler commented on PIG-3015:
---

[~rohini] OK, looks like I implemented the helper functions, and implemented 
the functionality for Trevni, but didn't implement it for AvroStorage. Will 
follow up with a patch.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-10.patch, 
 PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, 
 PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, 
 with_dates.pig


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-04-15 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13632347#comment-13632347
 ] 

Joseph Adler commented on PIG-3015:
---

Sorry to have taken so long to reply. 

I map any Pig type to a union of an Avro Type and Null. Here are the type 
mappings that I implemented:

Bag - Array
Big Chararray - String
Byte Array - Bytes
Chararray - String
Datetime - Long
Double - Double
Float - Float
Integer - Int
Map - Map
Null - Null
Tuple - Record

Byte, Error, Generic Writable, Internal Map, Unknown aren't mapped to anything 
yet. Do we need to store these as well?

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-10.patch, 
 PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, 
 PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, 
 with_dates.pig


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-03-18 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605579#comment-13605579
 ] 

Joseph Adler commented on PIG-3015:
---

I like the -tagsource option idea. Should we allow the user to provide a name 
for the tag source field? (If we picked a name like tagSource, and there 
was already a field in the avro Schema called tagSource, I'm concerned that 
we'd have to deal with that conflict. I think it would be cleaner to let the 
end user resolve the naming issue.)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-10.patch, 
 PIG-3015-11.patch, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, 
 PIG-3015-doc-2.patch, PIG-3015-doc.patch, TestInput.java, Test.java, 
 with_dates.pig


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-03-05 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: with_dates.pig

Missing test file (not a patch)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-10.patch, 
 PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, PIG-3015-5.patch, 
 PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-9.patch, PIG-3015-doc.patch, 
 TestInput.java, Test.java, with_dates.pig


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-02-19 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581460#comment-13581460
 ] 

Joseph Adler commented on PIG-3015:
---

[~russell.jurney]: ]Reading through the stack trace that you posted, it does 
not look like the null pointer exception was occurring in TrevniStorage. (It 
looks like it was occurring in the Tokenizer). Does your script work correctly 
if you use it with another format, like PigStorage?



 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, 
 PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, 
 PIG-3015-doc.patch, TestInput.java, Test.java


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-02-19 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015-9.patch

Added support for Pig dates to AvroStorage and TrevniStorage (they're 
translated to longs when storing values). Also added a new test case.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, 
 PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, 
 PIG-3015-9.patch, PIG-3015-doc.patch, TestInput.java, Test.java


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-02-12 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577185#comment-13577185
 ] 

Joseph Adler commented on PIG-3015:
---

I think the method setLocation for AvroStoage is marked as final. Does anyone 
object to removing the final modifier? 

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, 
 PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, 
 PIG-3015-8.patch, TestInput.java, Test.java


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-02-11 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015-8.patch

Added description of AvroStorage and TrevniStorage to documentation. (Not 
finished editing yet, but wanted to share what I'd written so far.)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, 
 PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, 
 PIG-3015-8.patch, TestInput.java, Test.java


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-02-01 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569190#comment-13569190
 ] 

Joseph Adler commented on PIG-3015:
---

Let me know what help you need. I can work on the documentation as well. Is 
early next week enough time? (Also, check out Avro-1241. I couldn't get 
adequate performance without it.)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, 
 PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, 
 TestInput.java, Test.java


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2266) bug with input file joining optimization in Pig

2013-01-28 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564924#comment-13564924
 ] 

Joseph Adler commented on PIG-2266:
---

Thanks for adding this fix!

 bug with input file joining optimization in Pig
 ---

 Key: PIG-2266
 URL: https://issues.apache.org/jira/browse/PIG-2266
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.9.0, 0.10.0
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-2266.patch


 In 
 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
  the function hasTooManyInputFiles instantiated a LoadFunc instance, then 
 calls setLocation before calling setUDFContextSignature. This is inconsistent 
 with the documentation for the LoadFunc interface (see 
 http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/LoadFunc.html#setUDFContextSignature(java.lang.String)).
  (We've written UDFs that assume that setUDFContextSignature is called first.)
 I think you can fix this by adding 
loader.setUDFContextSignature(ld.getSignature());
 Before
loader.setLocation(location, job);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-01-28 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564926#comment-13564926
 ] 

Joseph Adler commented on PIG-3015:
---

Sorry, didn't mean to submit a patch with Avro 1.7.4-SNAPSHOT. I added a couple 
optimizations to Trevni so that the performance was comparable with Avro. (I'll 
submit that patch to Avro.)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, 
 PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, 
 TestInput.java, Test.java


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-01-24 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015-6.patch

Some additional bug fixes:

- Now correctly identifies recursive schema definitions
- TrevniStorage was not correctly flushing output buffers before closing, 
causing files to be corrupted

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, 
 PIG-3015-4.patch, PIG-3015-5.patch, PIG-3015-6.patch, TestInput.java, 
 Test.java


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3120) setStoreFuncUDFContextSignature called with null signature

2013-01-14 Thread Joseph Adler (JIRA)
Joseph Adler created PIG-3120:
-

 Summary: setStoreFuncUDFContextSignature called with null signature
 Key: PIG-3120
 URL: https://issues.apache.org/jira/browse/PIG-3120
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.12
Reporter: Joseph Adler
Priority: Critical
 Fix For: 0.12


I'm currently working on PIG-3015 and am having trouble passing the 
UDFContextSignature to the store func. It looks like the store func on the head 
end is being set to a non-null value, but a null value is being passed to 
setStoreFuncUDFContextSignature on the back end. I'm opening this ticket to 
track this issue; I'll follow up with a reproducible test case when I have a 
clean one.

I suspect this problem occurs when running on a real cluster, but may not occur 
in the standard unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3120) setStoreFuncUDFContextSignature called with null signature

2013-01-14 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553352#comment-13553352
 ] 

Joseph Adler commented on PIG-3120:
---

OK, tracked down the issue. It looks like the UDFContextSignature is not 
getting propagated if there is a LIMIT statement in the pig code.

Very specifically, in 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.LimitAdjuster.adjust,
 it looks like Pig was creating a new POStore object but not copying the 
signature. Here is the offending code:

{code}
// this is line 132...
POStore st = new POStore(new 
OperatorKey(scope,nig.getNextNodeId(scope)));
st.setSFile(oldSpec);
st.setIsTmpStore(oldIsTmpStore);
st.setSchema(((POStore)mpLeaf).getSchema());

limitAdjustMROp.reducePlan.addAsLeaf(st);
{code}

This is easily fixable by inserting this statement at line 137:

{code}
st.setSignature(((POStore)mpLeaf).getSignature());
{code}
I'll follow up with a path for this issue.

 setStoreFuncUDFContextSignature called with null signature
 --

 Key: PIG-3120
 URL: https://issues.apache.org/jira/browse/PIG-3120
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.12
Reporter: Joseph Adler
Priority: Critical
 Fix For: 0.12


 I'm currently working on PIG-3015 and am having trouble passing the 
 UDFContextSignature to the store func. It looks like the store func on the 
 head end is being set to a non-null value, but a null value is being passed 
 to setStoreFuncUDFContextSignature on the back end. I'm opening this ticket 
 to track this issue; I'll follow up with a reproducible test case when I have 
 a clean one.
 I suspect this problem occurs when running on a real cluster, but may not 
 occur in the standard unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3120) setStoreFuncUDFContextSignature called with null signature

2013-01-14 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3120:
--

Status: Patch Available  (was: Open)

 setStoreFuncUDFContextSignature called with null signature
 --

 Key: PIG-3120
 URL: https://issues.apache.org/jira/browse/PIG-3120
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.12
Reporter: Joseph Adler
Priority: Critical
 Fix For: 0.12


 I'm currently working on PIG-3015 and am having trouble passing the 
 UDFContextSignature to the store func. It looks like the store func on the 
 head end is being set to a non-null value, but a null value is being passed 
 to setStoreFuncUDFContextSignature on the back end. I'm opening this ticket 
 to track this issue; I'll follow up with a reproducible test case when I have 
 a clean one.
 I suspect this problem occurs when running on a real cluster, but may not 
 occur in the standard unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3120) setStoreFuncUDFContextSignature called with null signature

2013-01-14 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3120:
--

Status: Open  (was: Patch Available)

 setStoreFuncUDFContextSignature called with null signature
 --

 Key: PIG-3120
 URL: https://issues.apache.org/jira/browse/PIG-3120
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.12
Reporter: Joseph Adler
Priority: Critical
 Fix For: 0.12


 I'm currently working on PIG-3015 and am having trouble passing the 
 UDFContextSignature to the store func. It looks like the store func on the 
 head end is being set to a non-null value, but a null value is being passed 
 to setStoreFuncUDFContextSignature on the back end. I'm opening this ticket 
 to track this issue; I'll follow up with a reproducible test case when I have 
 a clean one.
 I suspect this problem occurs when running on a real cluster, but may not 
 occur in the standard unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3120) setStoreFuncUDFContextSignature called with null signature

2013-01-14 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3120:
--

Attachment: PIG-3120.patch

This patch resolves an issue with UDF StoreFunc signatures when using LIMIT 
statements

 setStoreFuncUDFContextSignature called with null signature
 --

 Key: PIG-3120
 URL: https://issues.apache.org/jira/browse/PIG-3120
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.12
Reporter: Joseph Adler
Priority: Critical
 Fix For: 0.12

 Attachments: PIG-3120.patch


 I'm currently working on PIG-3015 and am having trouble passing the 
 UDFContextSignature to the store func. It looks like the store func on the 
 head end is being set to a non-null value, but a null value is being passed 
 to setStoreFuncUDFContextSignature on the back end. I'm opening this ticket 
 to track this issue; I'll follow up with a reproducible test case when I have 
 a clean one.
 I suspect this problem occurs when running on a real cluster, but may not 
 occur in the standard unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-01-11 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551502#comment-13551502
 ] 

Joseph Adler commented on PIG-3015:
---

Just got bitten by PIG-2266 while doing some performance testing with this 
ticket. I'm going to add that fix to this patch so that AvroStorage and 
TrevniStorage actually work.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, 
 PIG-3015-4.patch, PIG-3015-5.patch, TestInput.java, Test.java


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-01-07 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546164#comment-13546164
 ] 

Joseph Adler commented on PIG-3015:
---

Hi Cheolsoo:

What size file are you using? You can configure the sync interval with the 
parameter avro.mapred.sync.interval (defined in 
org.apache.avro.mapred.AvroOutputFormat), and implemented in my latest patch 
(the one from last week).

-- Joe

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch, Test.tar.gz


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015-5.patch

Added fixes for compression (and other metadata)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch, PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015.patch)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015-5.patch)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015-5.patch

Oops, this one contains the changes.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3015 Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler
/recordsWithDoubleUnderscores.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/simpleRecordsTrevni.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/testDirectory.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/testDirectoryCounts.avsc PRE-CREATION 
  test/unit-tests 7cede06 

Diff: https://reviews.apache.org/r/8104/diff/


Testing
---


Thanks,

Joseph Adler



Re: Review Request: PIG-3015 Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler
/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/simpleRecordsTrevni.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/testDirectory.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/testDirectoryCounts.avsc PRE-CREATION 
  test/unit-tests 7cede06 

Diff: https://reviews.apache.org/r/8104/diff/


Testing
---


Thanks,

Joseph Adler



[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544248#comment-13544248
 ] 

Joseph Adler commented on PIG-3015:
---

Hi Russ,

I think you're right... it looks like you could do something like this in 
AvroRecordReader.nextKeyValue:

{code}
  @Override
  public boolean nextKeyValue() throws IOException, InterruptedException {

if (reader.pastSync(end)) {
  return false;
}

try {
  currentRecord = reader.next(new GenericData.Record(schema));
} catch (NoSuchElementException e) {
  return false;
} catch (IOException ioe) {
  reader.sync(reader.tell()+1);
  throw ioe;
}

return true;
  }
{code}

Let me test this out to make sure it runs correctly on uncorrupted files. Would 
you mind creating a corrupted test file that I can use for testing?

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-01-03 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543455#comment-13543455
 ] 

Joseph Adler commented on PIG-3015:
---

Hi Cheolsoo,

You're totally right; I don't check the compression properties. I know that the 
avro mapred library does check those parameters 
(org.apache.avro.mapred.AvroOutPutFormat), but I don't use that output format. 
Fixing and testing, will follow up with a patch.

-- Joe

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3059) Global configurable minimum 'bad record' thresholds

2013-01-02 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542718#comment-13542718
 ] 

Joseph Adler commented on PIG-3059:
---

Sorry to take so long to get back to this. It was a long break from work...

Thanks so much for taking this over. I like the way you've implemented this.

 Global configurable minimum 'bad record' thresholds
 ---

 Key: PIG-3059
 URL: https://issues.apache.org/jira/browse/PIG-3059
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.11
Reporter: Russell Jurney
Assignee: Cheolsoo Park
 Fix For: 0.12

 Attachments: avro_test_files-2.tar.gz, PIG-3059-2.patch, 
 PIG-3059.patch


 See PIG-2614. 
 Pig dies when one record in a LOAD of a billion records fails to parse. This 
 is almost certainly not the desired behavior. elephant-bird and some other 
 storage UDFs have minimum thresholds in terms of percent and count that must 
 be exceeded before a job will fail outright.
 We need these limits to be configurable for Pig, globally. I've come to 
 realize what a major problem Pig's crashing on bad records is for new Pig 
 users. I believe this feature can greatly improve Pig.
 An example of a config would look like:
 pig.storage.bad.record.threshold=0.01
 pig.storage.bad.record.min=100
 A thorough discussion of this issue is available here: 
 http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-20 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015.patch)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3015 Rewrite of AvroStorage

2012-12-20 Thread Joseph Adler
/recordsOfArrays.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithDoubleUnderscores.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/simpleRecordsTrevni.avsc PRE-CREATION 
  test/unit-tests 7cede06 

Diff: https://reviews.apache.org/r/8104/diff/


Testing
---


Thanks,

Joseph Adler



[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-17 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015.patch)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-17 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015.patch

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-17 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13534212#comment-13534212
 ] 

Joseph Adler commented on PIG-3015:
---

My apologies; forgot to add those to the patch. Replaced the patch version.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3015 Rewrite of AvroStorage

2012-12-17 Thread Joseph Adler
 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithDoubleUnderscores.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/simpleRecordsTrevni.avsc PRE-CREATION 
  test/unit-tests 7cede06 

Diff: https://reviews.apache.org/r/8104/diff/


Testing
---


Thanks,

Joseph Adler



[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-11 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13529293#comment-13529293
 ] 

Joseph Adler commented on PIG-3015:
---

Ivy should be able to pull the jar from a maven repo. Do you need to build your 
own Avro jar from source?

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-06 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015.patch)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-06 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015.patch

Added test cases for TrevniStorage (and made sure the test cases all pass)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-06 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Description: 
The current AvroStorage implementation has a lot of issues: it requires old 
versions of Avro, it copies data much more than needed, and it's verbose and 
complicated. (One pet peeve of mine is that old versions of Avro don't support 
Snappy compression.)

I rewrote AvroStorage from scratch to fix these issues. In early tests, the new 
implementation is significantly faster, and the code is a lot simpler. 
Rewriting AvroStorage also enabled me to implement support for Trevni (as 
TrevniStorage).

I'm opening this ticket to facilitate discussion while I figure out the best 
way to contribute the changes back to Apache.

  was:
The current AvroStorage implementation has a lot of issues: it requires old 
versions of Avro, it copies data much more than needed, and it's verbose and 
complicated. (One pet peeve of mine is that old versions of Avro don't support 
Snappy compression.)

I rewrote AvroStorage from scratch to fix these issues. In early tests, the new 
implementation is significantly faster, and the code is a lot simpler. 
Rewriting AvroStorage also enabled me to implement support for Trevni.

I'm opening this ticket to facilitate discussion while I figure out the best 
way to contribute the changes back to Apache.


 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-05 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510629#comment-13510629
 ] 

Joseph Adler commented on PIG-3015:
---

Hi Johannes,

As you probably know, the Avro specification limits the set of valid characters 
in names (see http://avro.apache.org/docs/current/spec.html#Names). Names must

- start with [A-Za-z_]
- subsequently contain only [A-Za-z0-9_]

So double colons aren't allowed. PIG-2684 proposes using namespaces as the 
solution. I think that's a poor choice; namespaces are often used for other 
purposes. Specifically, names spaces are essential if you are writing 
complicated data processing software that processes multiple types of avro 
serialized objects. In my experience, the avro schema and protocol compilers 
produce much better, more usable code if you use name spaces.

There are two good workarounds:

- The Pig user can rename variables in a bag before storing the bag using 
AvroStorage
- The Pig user can manually specify the output schema before storing the bag 
with AvroStorage

So, here's a specific suggestion:

- By default, throw an exception if the pig schema contains a name with a 
double-colon and the user does not specify an output schema
- Add an option to AvroStorage to transform double colons to something else. 
(Maybe double underscores? Maybe storing them in the namespace?)

What do you think?



 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2684) :: in field name causes AvroStorage to fail

2012-12-05 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510645#comment-13510645
 ] 

Joseph Adler commented on PIG-2684:
---

I'm addressing this right now in PIG-3015. This isn't a bug; it's just a 
mismatch between the set of names that Avro allows and the names that Pig 
allows. (As a side note, there are good reasons why only some variable names 
are allowed in Avro: limiting the characters in names allows Avro to generate 
code to process Avro objects in a number of different languages. Colons in 
variable names would make it difficult to do this.)

First, there are two workaround for this problem right now:

- The user can rename variables before storing the bag
- The user can manually specify the output schema 

Second, I don't like the idea of using namespaces for this. Namespaces are 
important for specific record types in Avro; they are translated by the 
protocol and schema compiles into package names for java classes.

To make AvroStorage easier to user, I think it would make sense to add an 
option to AvroStorage to translate names with colons in some reasonable way: 
maybe translating the double colons to double underscores.

 :: in field name causes AvroStorage to fail
 ---

 Key: PIG-2684
 URL: https://issues.apache.org/jira/browse/PIG-2684
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Reporter: Fabian Alenius

 There appears to be a bug in AvroStorage which causes it to fail when there 
 are field names that contain ::
 For example, the following will fail:
 data = load 'test.txt' as (one, two);
 grp = GROUP data by (one, two);
 result = foreach grp generate FLATTEN(group); 
   
 
 store result into 'test.avro' using 
 org.apache.pig.piggybank.storage.avro.AvroStorage();
 ERROR 2999: Unexpected internal error. Illegal character in: group::one
 While the following will succeed:
 data = load 'test.txt' as (one, two);
 grp = GROUP data by (one, two);
 result = foreach grp generate FLATTEN(group) as (one,two);
  
 store result into 'test.avro' using 
 org.apache.pig.piggybank.storage.avro.AvroStorage();
 Here is a minimal test case:
 data = load 'test.txt' as (one::two, three);  
   
 
 store data into 'test.avro' using 
 org.apache.pig.piggybank.storage.avro.AvroStorage();

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-05 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015.patch

I added support for files that don't have records, added option for dealing 
with double colons in variable names.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3015 Rewrite of AvroStorage

2012-12-05 Thread Joseph Adler


 On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
  Overall looks great! I haven't gone through the test cases yet, but here 
  are my comments so far.
  
  
  1) I noticed that I cannot load .avro files that are not record types. For 
  example, I tried to load a .avro file whose schema is int as follows:
  
  [cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar getschema 
  foo2/test_int.avro 
  int
  
  [cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar tojson 
  foo2/test_int.avro 
  1
  
  in = LOAD 'foo2/test_int.avro' USING AvroStorage('int');
  DUMP in;
  
  This gives me the following error:
  
  Caused by: java.io.IOException: avroSchemaToResourceSchema only processes 
  records
  
  Can only Avro record type be loaded? Or am I doing something wrong?
  
  
  2) TestAvroStorage needs to be more automated. To run it, I had to run the 
  following commands:
  
  ant clean compile-test
  cd ./test/org/apache/pig/builtin/avro
  python createests.py
  cd -
  ant clean test -Dtestcase=TestAvroStorage
  
  Ideally, I should be able to run a single command: ant clean 
  -Dtestcase=TestAvroStorage. Please let me know if you need help for this.
  
  
  3) python createests.py fails with the following errors. I suppose that 
  some files are missing:
  
  creating data/avro/uncompressed/testDirectoryCounts.avro
  Exception in thread main java.io.FileNotFoundException: 
  data/json/testDirectoryCounts.json (No such file or directory)
  ...
  creating evenFileNameTestDirectoryCounts.avro
  Exception in thread main java.io.FileNotFoundException: 
  data/json/evenFileNameTestDirectoryCounts.json (No such file or directory)
  ...
  
  
  4) ant test -Dtestcase=TestAvroStorage fails with the following errors. I 
  suppose that this is due to the missing files:
  
  Testcase: testLoadDirectory took 0.005 sec
  FAILED
  Testcase: testLoadGlob took 0.004 sec
  FAILED
  Testcase: testPartialLoadGlob took 0.005 sec
  FAILED
  
  
  5) Typo in the name of createests.py. It should be createtests.py.
  
  
  6) Is createTests.bash needed at all? If not, can you remove it?
  
  
  I have more comments inline:

Sounds like the python script isn't working completely correctly. I'll debug 
that script and make sure it generates all the required files.

Can I take you up on your offer to help automate that build process? I'm not 
exactly sure what to modify to automatically run the python script to create 
the test files.


 On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
  src/org/apache/pig/builtin/AvroStorage.java, lines 296-305
  https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line296
 
  This won't work in the following case. Let's say p matches two dirs, 
  and one dir is empty.
  
  p = foo*
  
  foo1
  foo2/bar.avro
  
  I would expect the schema of bar.avro is returned, but I get an 
  IOException instead.

Added proper depth first search to find the first file. (I decided to sort by 
modification date, most recent first.)


- Joseph


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/#review13962
---


On Nov. 17, 2012, 5:28 a.m., Joseph Adler wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/8104/
 ---
 
 (Updated Nov. 17, 2012, 5:28 a.m.)
 
 
 Review request for pig and Cheolsoo Park.
 
 
 Description
 ---
 
 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 
 This is the latest version of the patch, complete with test cases and 
 TrevniStorage. (Test cases for TrevniStorage are still missing).
 
 
 This addresses bug PIG-3015.
 https://issues.apache.org/jira/browse/PIG-3015
 
 
 Diffs
 -
 
   build.xml 7d468a0 
   ivy.xml 70e8d50 
   ivy/libraries.properties 317564f 
   src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION 
   src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroBagWrapper.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroMapWrapper.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroRecordReader.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroRecordWriter.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java 
 PRE-CREATION 
   src/org/apache/pig

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-04 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509992#comment-13509992
 ] 

Joseph Adler commented on PIG-3015:
---

I think that approach makes sense; each object in a file should be wrapped in a 
Tuple. Suppose that a file example.avro contained the data:

  {[1, 2, 3, 4, 5]}
  {[6, 7, 8, 9, 10]}

and had this schema: {name : IntArray, type : array, items : int}, 
and we loaded this as

  A = LOAD 'example.avro' USING AvroStorage;

The bag A would have the Pig schema A:{(IntArray:{(int)})}; it would contain 
two tuples, which would in turn each contain one bag of integers. Does that 
sound correct? If so, I'll go implement that.


 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3015 Rewrite of AvroStorage

2012-12-03 Thread Joseph Adler


 On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
  src/org/apache/pig/builtin/AvroStorage.java, lines 171-172
  https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line171
 
  Same problem as above.

Fixing this one within getAvroSchema


 On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
  src/org/apache/pig/builtin/AvroStorage.java, lines 382-388
  https://reviews.apache.org/r/8104/diff/1/?file=191564#file191564line382
 
  Is this needed?
  
  In the constructor, schema is supposed to be set. If not, there must be 
  an error. Shouldn't we throw an exception instead of re-trying to set 
  schema?
  
  Please correct me if I am wrong.

Pretty sure you're right about this one (and that this code is redundant).


 On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
  src/org/apache/pig/builtin/TrevniStorage.java, line 160
  https://reviews.apache.org/r/8104/diff/1/?file=191565#file191565line160
 
  AvroStorage accepts files that do not end .avro. Shouldn't 
  TrevniStorage do the same?

Good point... though I realize that I've defined visible avro files and 
visible trevni files methods that are probably not useful. I should probably 
just drop the methods.


 On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
  src/org/apache/pig/impl/util/AvroRecordReader.java, lines 110-118
  https://reviews.apache.org/r/8104/diff/1/?file=191568#file191568line110
 
  I can't find where -ignoreErrors is used. I guess that error handling 
  for bad files is not implemented yet?

No, I haven't implemented it yet. I suspect that the best way to implement the 
error ignoring functionality is from within Pig, and should apply to all file 
types (not just Avro)... I'll add that discussion to the right JIRA thread


 On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
  src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java, 
  lines 85-91
  https://reviews.apache.org/r/8104/diff/1/?file=191571#file191571line85
 
  How about a union type that contains a single data type (e.g. 
  [string])? They're currently supported.

Good point; that's a trivial change. Adding that


 On Dec. 3, 2012, 7:22 p.m., Cheolsoo Park wrote:
  src/org/apache/pig/impl/util/AvroTupleWrapper.java, line 163
  https://reviews.apache.org/r/8104/diff/1/?file=191572#file191572line163
 
  Can you instead use log.debug(..., e)?

Just added the exception to the line logging line above


- Joseph


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/#review13962
---


On Nov. 17, 2012, 5:28 a.m., Joseph Adler wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/8104/
 ---
 
 (Updated Nov. 17, 2012, 5:28 a.m.)
 
 
 Review request for pig and Cheolsoo Park.
 
 
 Description
 ---
 
 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 
 This is the latest version of the patch, complete with test cases and 
 TrevniStorage. (Test cases for TrevniStorage are still missing).
 
 
 This addresses bug PIG-3015.
 https://issues.apache.org/jira/browse/PIG-3015
 
 
 Diffs
 -
 
   build.xml 7d468a0 
   ivy.xml 70e8d50 
   ivy/libraries.properties 317564f 
   src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION 
   src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroBagWrapper.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroMapWrapper.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroRecordReader.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroRecordWriter.java PRE-CREATION 
   src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java 
 PRE-CREATION 
   src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java 
 PRE-CREATION 
   src/org/apache/pig/impl/util/AvroTupleWrapper.java PRE-CREATION 
   test/commit-tests 5081fbc 
   test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION 
   test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION 
   test

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-03 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509296#comment-13509296
 ] 

Joseph Adler commented on PIG-3015:
---

I made most of the recommended changes (thanks for looking this over), and have 
a follow up question:

I have always assumed that AvroStorage was designed to be used with Hadoop 
sequence files that contained a series of records, so I implemented AvroStorage 
to only work with a file in this format. Are there cases where the highest 
level schema for a file will be another type? If so... what does that mean for 
pig? Is there one record per file?

Here's a specific example: suppose that we have this schema:

{name : IntArray, type : array, items : int}

Suppose that we have 3 files to load, each with this schema, each containing an 
array of 10 integers. Should we load this into pig as a single bag with 30 
integers? A bag containing three bags (each, in turn, containing 10 integers)? 
Or reject this file entirely?

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

2012-12-03 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509325#comment-13509325
 ] 

Joseph Adler commented on PIG-2614:
---

Could I propose an alternative? 

I like this functionality, but I don't think that this should be specific to 
Avro records. I think that is should be straightforward to modify 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader to 
implement this functionality for ALL LoadFunc types. Specifically, it should be 
possible to count the number of Exceptions thrown by the getNext method in the 
underlying load function (inside PigRecordReader.nextKeyValue).



 AvroStorage crashes on LOADING a single bad error
 -

 Key: PIG-2614
 URL: https://issues.apache.org/jira/browse/PIG-2614
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.10.0, 0.11
Reporter: Russell Jurney
Assignee: Jonathan Coveney
  Labels: avro, avrostorage, bad, book, cutting, doug, for, my, 
 pig, sadism
 Fix For: 0.11, 0.10.1

 Attachments: PIG-2614_0.patch, PIG-2614_1.patch, PIG-2614_2.patch, 
 test_avro_files.tar.gz


 AvroStorage dies when a single bad record exists, such as one with missing 
 fields.  This is very bad on 'big data,' where bad records are inevitable.  
 See discussion at 
 http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss
  for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-03 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Status: Open  (was: Patch Available)

replacing with revised patch

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler

 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-03 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015.patch)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler

 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-03 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Status: Patch Available  (was: Open)

Revised patch; reflects comments and suggestions from review board

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-03 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015.patch

Revised patch (compiles together all changes)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-11-28 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506099#comment-13506099
 ] 

Joseph Adler commented on PIG-3015:
---

Hi Timothy:

I have not tried the patch with Pig 0.10, but I don't know of any reason why it 
would not work. Give it a spin and let us know what happens.

-- Joe

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

2012-11-28 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506101#comment-13506101
 ] 

Joseph Adler commented on PIG-2614:
---

Repeating an old question: is there any reason that this patch is only for 
Avro? I think this could work for all storage types.

 AvroStorage crashes on LOADING a single bad error
 -

 Key: PIG-2614
 URL: https://issues.apache.org/jira/browse/PIG-2614
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.10.0, 0.11
Reporter: Russell Jurney
Assignee: Jonathan Coveney
  Labels: avro, avrostorage, bad, book, cutting, doug, for, my, 
 pig, sadism
 Fix For: 0.11, 0.10.1

 Attachments: PIG-2614_0.patch, PIG-2614_1.patch


 AvroStorage dies when a single bad record exists, such as one with missing 
 fields.  This is very bad on 'big data,' where bad records are inevitable.  
 See discussion at 
 http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss
  for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: LOAD multiple files with glob

2012-11-26 Thread Joseph Adler
It's a total rewrite, so it hasn't exactly made it in.

But yes, file globs should work correctly. That's one of the unit tests.
(All of the unit tests pass, incidentally.)


On Mon, Nov 26, 2012 at 10:23 AM, Russell Jurney
russell.jur...@gmail.comwrote:

 Is the globbing feature making it into the AvroStorage rewrite?

 Russell Jurney twitter.com/rjurney


 On Nov 26, 2012, at 7:50 AM, Bart Verwilst li...@verwilst.be wrote:

  To answer myself again, I compiled Pig 0.11 and Piggybank, and it's
 working very well now, globbing seems to be fully supported!
 
  Bart Verwilst schreef op 26.11.2012 15:33:
  To answer myself, could this be part of the solution? :
 
  https://issues.apache.org/jira/browse/PIG-2492
 
  Guess I'll have to wait for 0.11 then?
 
  Bart Verwilst schreef op 26.11.2012 14:19:
  14:16:08  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
  REGISTER 'hdfs:///lib/avro-1.7.2.jar';
  REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
  REGISTER 'hdfs:///lib/piggybank.jar';
 
  DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
  avro = load '/test/*' USING AvroStorage();
  describe avro;
 
  14:16:09  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
  Schema for avro unknown.
 
  14:16:17  centos6-hadoop-hishiru  ~ $ vim avro-test.pig
 
  14:16:25  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
  REGISTER 'hdfs:///lib/avro-1.7.2.jar';
  REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
  REGISTER 'hdfs:///lib/piggybank.jar';
 
  DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
  avro = load '/test/2012-11-25.avro' USING AvroStorage();
  describe avro;
 
  14:16:30  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
  avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
  int,heading: int,terminalid: int,customerid: chararray,mileage:
  int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
  (id: long,value: chararray,pkey: chararray)}}
 
  14:16:55  centos6-hadoop-hishiru  ~ $ hadoop fs -ls /test/
  Found 1 items
  -rw-r--r--   3 hdfs supergroup   63140500 2012-11-26 14:13
 /test/2012-11-25.avro
 
  Cheolsoo Park schreef op 26.11.2012 10:45:
  Hi,
 
  Invalid field projection. Projected field [tracetype] does not
 exist.
 
  The error indicates that the tracetype doesn't exist in the Pig
 schema of
  the relation avro. What AvroStorage does is to automatically
 convert Avro
  schema to Pig schema during the load. Although you have tracetype
 in your
  Avro schema, tracetype doesn't exist in the generated Pig schema for
  whatever reason.
 
  Can you please try to describe avro? You can replace group and dump
  commands with describe in your Pig script. This will show you what
 the Pig
  schema of avro is. If tracetype indeed doesn't exist, you have to
 find
  out why it doesn't. It could be because the schema of .avro files is
 not
  the same or because there is a bug in AvroStorage, etc.
 
  Maybe globbing with [] doesnt work, but wildcard works?
 
  You're right. AvroStorage internally uses Hadoop path globing, and
 Hadoop
  path globing doesn't support '[ ]'. But the above error (Projected
 field
  [tracetype] does not exist) is not because of this.
 URISyntaxException is
  what you will get because of '[ ]'.
 
  Thanks,
  Cheolsoo
 
 
 
  On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst li...@verwilst.be
 wrote:
 
  Just tried this:
 
 
  --**--
  REGISTER 'hdfs:///lib/avro-1.7.2.jar';
  REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
  REGISTER 'hdfs:///lib/piggybank.jar';
 
  DEFINE AvroStorage
 org.apache.pig.piggybank.**storage.avro.AvroStorage();
 
  avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING
 AvroStorage();
 
  groups = group avro by tracetype;
 
  dump groups;
  --**--
 
  gave me:
 
  file avro-test.pig, line 10, column 23 Invalid field projection.
  Projected field [tracetype] does not exist.
 
  Pig Stack Trace
  ---
  ERROR 1025:
  file avro-test.pig, line 10, column 23 Invalid field projection.
  Projected field [tracetype] does not exist.
 
  org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066:
  Unable to open iterator for alias groups
 at
 org.apache.pig.PigServer.**openIterator(PigServer.java:**862)
 at org.apache.pig.tools.grunt.**GruntParser.processDump(**
  GruntParser.java:682)
 at org.apache.pig.tools.**pigscript.parser.**
  PigScriptParser.parse(**PigScriptParser.java:303)
 at
 org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(**
  GruntParser.java:189)
 at
 org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(**
  GruntParser.java:165)
 at org.apache.pig.tools.grunt.**Grunt.exec(Grunt.java:84)
 at org.apache.pig.Main.run(Main.**java:555)
 at org.apache.pig.Main.main(Main.**java:111)
 at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native
 Method)
 at 

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

2012-11-20 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501339#comment-13501339
 ] 

Joseph Adler commented on PIG-2614:
---

Just taking a look at this patch now

If I'm reading the code correctly, it should not be necesarry for the author of 
a UDF (specifically a LoadFunc) to do anything special to take advantage of 
this functionality. Is that correct?

 AvroStorage crashes on LOADING a single bad error
 -

 Key: PIG-2614
 URL: https://issues.apache.org/jira/browse/PIG-2614
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.10.0, 0.11
Reporter: Russell Jurney
Assignee: Jonathan Coveney
  Labels: avro, avrostorage, bad, book, cutting, doug, for, my, 
 pig, sadism
 Fix For: 0.11, 0.10.1

 Attachments: PIG-2614_0.patch, PIG-2614_1.patch


 AvroStorage dies when a single bad record exists, such as one with missing 
 fields.  This is very bad on 'big data,' where bad records are inevitable.  
 See discussion at 
 http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss
  for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-11-20 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501340#comment-13501340
 ] 

Joseph Adler commented on PIG-3015:
---

I just took at look at PIG-2614. It looks like the PIG-2614 patch will be 
compatible with this patch; PIG-2614 simply counts errors as values are read 
from a LoadFunc. Am I missing something? I'd be happy to drop the option to 
ignore bad records; I think that would make the options for this function 
cleaner and easier to understand.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request: PIG-3015 Rewrite of AvroStorage

2012-11-16 Thread Joseph Adler
-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recursiveRecord.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordWithRepeatedSubRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/records.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsAsOutputByPig.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsOfArrays.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION 
  test/unit-tests 0f18a0e 

Diff: https://reviews.apache.org/r/8104/diff/


Testing
---


Thanks,

Joseph Adler



[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-11-16 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13499348#comment-13499348
 ] 

Joseph Adler commented on PIG-3015:
---

I have made all the changes that you suggested (including rewriting the script 
that builds test cases in Python) and have uploaded the new version to the RB: 
https://reviews.apache.org/r/8104/

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (PIG-3015) Rewrite of AvroStorage

2012-11-15 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-3015 started by Joseph Adler.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler

 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-11-15 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Status: Patch Available  (was: In Progress)

Here is a patch with a working implementation (plus new unit tests and a bash 
script to generate the test data files; just run the bash script in the 
test/org/apache/pig/builtin/avro directory to generate all the avro files 
needed for testing)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-11-15 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Patch Info: Patch Available

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-11-15 Thread Joseph Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015.patch

Here's the generated patch file.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-11-13 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496730#comment-13496730
 ] 

Joseph Adler commented on PIG-3015:
---

Just TestAvroStorage, yes. I'm not trying to rewrite the whole test system, 
just clean up the AvroStorage tests. And yes, I'd want to either make an 
exception for corrupted Avro files or have a job that corrupts the files. 

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler

 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-11-08 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493550#comment-13493550
 ] 

Joseph Adler commented on PIG-3015:
---

I hate breaking backwards compatibility. (One of the reaons for doing the 
rewrite is that Avro broke backwards compatibility.) But I think we have some 
good reasons to do so here:

- Options for AvroStorage are very different than options for other storage 
functions in Pig. In moving AvroStorage to builtin, it makes sense for 
AvroStorage to behave as close as possible to PigStorage, etc.
- The huge number of crazy options make the code slow and complicated.
- There are good workarounds for many changes in the options. For example, all 
the weird stuff about selecting a schema using an index could be easily changed 
to explicit schema definitions.
- It gets harder to make changes with time. This is probably the best 
opportunity to make the options simpler and clearer.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler

 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-11-06 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13491741#comment-13491741
 ] 

Joseph Adler commented on PIG-3015:
---

I put the code in o.a.impl.util. Not a big deal to move it later if that's the 
preferred style.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler

 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-10-31 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487954#comment-13487954
 ] 

Joseph Adler commented on PIG-3015:
---

Before addressing the questions, I wanted to propose a naming schema for the 
load and store functions. To be consistent with other Pig UDFs, I think it 
makes more sense to use different function names rather than passing different 
types of arguments to the UDF. Can I propose something like this:

LoadFuncs:

- AvroStorage. May be instantiated with zero, one, or two arguments. If called 
with no arguments, the function will load the schema from the most recent data 
file found in the specified path and use that schema. If called with one 
argument, the argument will be a String that specifies the input schema. The 
String may either contain the schema definition, may be a URI that refers to 
the location of the input schema in a file, or may be an example data file from 
which to read the schema. If two arguments are specified, the first argument 
refers to the type of the output records (the name of the type) and the second 
argument may be either a JSON string, a URI for a schema definition file, or a 
URI for an example file that contains the definition of that type.

 This function does not check schema compatibility of input files or allow 
recursive schema definitions. Fails when corrupted files are encountered.
- AvroStorage.AllowRecursive. Same as above, except this function does not 
check schema compatibility of input files but does allow recursive schema 
definitions. Recursively defined records are just defined as schemaless tuples 
in the Pig Schema.
- AvroStorage.IgnoreCorrupted Same as above, except this function will not 
allow recursive schema definitions, but will not fail on corrupted input files.
- AvroStorage.AllowRecursiveAndIgnoreCorrupted Same as above, except this 
function allows recursive definitions and does not fail on corrupted input 
files.


StoreFunc:

- AvroStorage. May be instantiated with zero, one, or two arguments; the 
meaning of the arguments can be inferred from how they are specified. If called 
with no arguments, the function will translate the pig schema to an Avro 
schema, use a default name for the record types, and not assign a namespace to 
the records. If called with one argument, the argument will be a String that 
may specify the output schema, or may specify the record name for the output 
records. If the string specifies the schema definition, may be a URI that 
refers to the location of the input schema in a file, or may be an example data 
file from which to reuse the schema. If two arguments are specified, they may 
refer to the name and namespace for the output records. Alternately, the first 
argument may refer to the type of the output records (the name of the schema), 
and the second argument may be either a JSON string, a URI for a schema 
definition file, or a URI for an example file that contains the definition of 
that type.


Answers to questions:

LoadFunc 1a: Yes, the storage function will convert avro schemas to pig 
schemas, and vice versa. 

I haven't tried to convert multiple compatible but different schemas to one 
pig schema. I believe that if you manually supply a schema to the function that 
is a superset of all the schemas in the input data, the underlying Avro 
libraries will take care of this for you... though this brings up another 
question: what does compatible mean in this case? Personally, I do not think 
that the core Pig library should attempt to resolve this problem for users; I 
think it is best for users to load files with different load functions, cast 
and rename fields as appropriate in pig code, then take a union of the values. 
It's possible to miss real (and important) errors if Pig does a lot of type 
conversions and manipulations under the covers.

LoadFunc 2: I think this is necessary for a few reasons: It's faster to supply 
a schema manually (the Pig run time doesn't have to read files from HDFS at 
planning time to detect the schema). By specifying the schema, you can also 
specify a subset of fields to de-serialize, reducing the size of the input 
data. Finally, by specifying a schema manually, you can read a set of files 
with compatible but different schemas.

I think PIG-2875 is a design mistake. If I had been involved in the project, I 
would have argued hard against this. You can't specify a recursive schema in 
Pig, so why allow users to load files with recursive schemas in Pig? It is 
possible to load recursively defined records into pig, but that seems like a 
recipe for confusion and errors. By default, recursive schema definitions 
should result in an error, or at least a warning message. I'd propose that this 
be allowed only as an option.

Storefunc 2a:

I don't think it's hard to specfiy those three options. It's probably OK for 
the StoreFunc

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-10-30 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487179#comment-13487179
 ] 

Joseph Adler commented on PIG-3015:
---

Just reading through the discussion on the user list.

I'll check out trunk, refactor/rename as needed, make sure it passes existing 
tests, fix bugs, then submit the patches. That will probably take me a few days 
to do.

Additionally, I'd like to get a few things correct the first time. 
Specifically, I'm trying to figure out how to deal with the plethora of 
possible options for load/store functions. I want to make sure that I cover all 
the important use cases regarding schemas. Here's the list that I came up with:

LoadFunc:
(1) Read the schema from the input file(s)
  (a) Just pick the schema from the most recent file
  (b) Check all the files to make sure the schemas are compatible
(2) Use a schema manually provided by the user

StoreFunc:
(1) Automatically translate the Pig schema to an Avro Schema
(2) Use a schema manually provided by the user
  (a) Allow the user to name the records and name space
  (b) Automatically pick a record and namespace name


 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler

 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3015) Rewrite of AvroStorage

2012-10-29 Thread Joseph Adler (JIRA)
Joseph Adler created PIG-3015:
-

 Summary: Rewrite of AvroStorage
 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler


The current AvroStorage implementation has a lot of issues: it requires old 
versions of Avro, it copies data much more than needed, and it's verbose and 
complicated. (One pet peeve of mine is that old versions of Avro don't support 
Snappy compression.)

I rewrote AvroStorage from scratch to fix these issues. In early tests, the new 
implementation is significantly faster, and the code is a lot simpler. 
Rewriting AvroStorage also enabled me to implement support for Trevni.

I'm opening this ticket to facilitate discussion while I figure out the best 
way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-10-29 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486489#comment-13486489
 ] 

Joseph Adler commented on PIG-3015:
---

Here's the working version: https://github.com/josephadler/fast-avro-storage

I can break that up into multiple Jira tickets, though that feels like a lot of 
extra work; I threw away all the existing code and started from scratch. I do 
think it's reasonable to separate AvroStorage and TrevniStorage for now (though 
they are very closely related)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler

 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni.
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [VOTE] Release Pig 0.10.0 (candidate 0)

2012-04-24 Thread Joseph Adler
Can you guys please fix https://issues.apache.org/jira/browse/PIG-2266

Without that, I can guarantee that AvroStorage will fail on large files.

On Mon, Apr 23, 2012 at 8:07 PM, Joseph Adler joseph.ad...@me.com wrote:
 I will do it tomorrow on one of my workflows. Could take some trial and error 
 to get it working.

 -- Joe

 On Apr 23, 2012, at 6:53 PM, Russell Jurney russell.jur...@gmail.com wrote:

 Can someone from LinkedIn try this release candidate? It may break
 your AvroStorage, so that would be good to know.

 Russell Jurney http://datasyndrome.com

 On Apr 23, 2012, at 6:36 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 +1


 Verified several jobs using Elephant-Bird loaders.
 Tested correctness with pig.exec.mapPartAgg both true and false.
 Verified license.
 Verified release notes.
 Ran test-commit

 D

 On Sat, Apr 21, 2012 at 12:27 PM, Daniel Dai da...@hortonworks.com wrote:
 We should do sanity check of the package, such as unit tests, e2e
 tests, piggybank tests, package integrity, package signature, license,
 etc. However, if we find a new bug, usually we will push it to the
 next release at this stage unless it is a critical one.

 Thanks,
 Daniel

 On Sat, Apr 21, 2012 at 12:48 AM, Prashant Kommireddi
 prash1...@gmail.com wrote:
 Hi Daniel,

 What is required other than running the regular tests for testing release
 candidate? I can think of running a few existing scripts against candidate
 build and making sure outputs look fine.

 Thanks,
 Prashant

 On Fri, Apr 20, 2012 at 12:39 AM, Daniel Dai da...@hortonworks.com 
 wrote:

 Hi,

 I have created a candidate build for Pig 0.10.0.

 Keys used to sign the release are available at
 http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup.

 Please download, test, and try it out:

 http://people.apache.org/~daijy/pig-0.10.0-candidate-0/

 Should we release this? Vote closes on next Tuesday, Apr 24th.

 Daniel



Re: [VOTE] Release Pig 0.10.0 (candidate 0)

2012-04-23 Thread Joseph Adler
I will do it tomorrow on one of my workflows. Could take some trial and error 
to get it working. 

-- Joe

On Apr 23, 2012, at 6:53 PM, Russell Jurney russell.jur...@gmail.com wrote:

 Can someone from LinkedIn try this release candidate? It may break
 your AvroStorage, so that would be good to know.
 
 Russell Jurney http://datasyndrome.com
 
 On Apr 23, 2012, at 6:36 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
 
 +1
 
 
 Verified several jobs using Elephant-Bird loaders.
 Tested correctness with pig.exec.mapPartAgg both true and false.
 Verified license.
 Verified release notes.
 Ran test-commit
 
 D
 
 On Sat, Apr 21, 2012 at 12:27 PM, Daniel Dai da...@hortonworks.com wrote:
 We should do sanity check of the package, such as unit tests, e2e
 tests, piggybank tests, package integrity, package signature, license,
 etc. However, if we find a new bug, usually we will push it to the
 next release at this stage unless it is a critical one.
 
 Thanks,
 Daniel
 
 On Sat, Apr 21, 2012 at 12:48 AM, Prashant Kommireddi
 prash1...@gmail.com wrote:
 Hi Daniel,
 
 What is required other than running the regular tests for testing release
 candidate? I can think of running a few existing scripts against candidate
 build and making sure outputs look fine.
 
 Thanks,
 Prashant
 
 On Fri, Apr 20, 2012 at 12:39 AM, Daniel Dai da...@hortonworks.com wrote:
 
 Hi,
 
 I have created a candidate build for Pig 0.10.0.
 
 Keys used to sign the release are available at
 http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup.
 
 Please download, test, and try it out:
 
 http://people.apache.org/~daijy/pig-0.10.0-candidate-0/
 
 Should we release this? Vote closes on next Tuesday, Apr 24th.
 
 Daniel
 


[jira] [Created] (PIG-2378) macros don't accept references to items within tuples as arguments

2011-11-16 Thread Joseph Adler (Created) (JIRA)
macros don't accept references to items within tuples as arguments
--

 Key: PIG-2378
 URL: https://issues.apache.org/jira/browse/PIG-2378
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.9.1
Reporter: Joseph Adler


I'd like to be able to pass a reference to an item within a parameter to a Pig 
Macro.

For example, suppose that I had a relation A with the schema A:{id:long, 
header:(time:long, type:chararray)}. I'd like to call a macro by typing:

   B = MY_MACRO(A, header.time);

but this does not currently work. Obviously, I could define a new relation as a 
workaround, for example I could use some pig code like 

  AA = FOREACH a GENERATE *, header.time as time;
  B = MY_MACRO(AA, time);

But that's ugly and clunky

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2266) bug with input file joining optimization in Pig

2011-09-06 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13098408#comment-13098408
 ] 

Joseph Adler commented on PIG-2266:
---

Index: MRCompiler.java
===
--- MRCompiler.java (revision 1165764)
+++ MRCompiler.java (working copy)
@@ -1353,7 +1353,8 @@
 .instantiateFuncFromSpec(ld.getLFile()
 .getFuncSpec());
 Job job = new Job(conf);
-loader.setLocation(location, job);
+loader.setUDFContextSignature(ld.getSignature()); 
+   loader.setLocation(location, job);
 InputFormat inf = loader.getInputFormat();
 ListInputSplit splits = 
inf.getSplits(HadoopShims.cloneJobContext(job));
 ListListInputSplit results = MapRedUtil


 bug with input file joining optimization in Pig
 ---

 Key: PIG-2266
 URL: https://issues.apache.org/jira/browse/PIG-2266
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.9.0
Reporter: Joseph Adler

 In 
 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java,
  the function hasTooManyInputFiles instantiated a LoadFunc instance, then 
 calls setLocation before calling setUDFContextSignature. This is inconsistent 
 with the documentation for the LoadFunc interface (see 
 http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/LoadFunc.html#setUDFContextSignature(java.lang.String)).
  (We've written UDFs that assume that setUDFContextSignature is called first.)
 I think you can fix this by adding 
loader.setUDFContextSignature(ld.getSignature());
 Before
loader.setLocation(location, job);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira