from:"justin coffey"

Justin Coffey created HIVE-6994:
---

 Summary: parquet-hive createArray strips null elements
 Key: HIVE-6994
 URL: https://issues.apache.org/jira/browse/HIVE-6994
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Justin Coffey
Assignee: Justin Coffey
 Fix For: 0.14.0


The createArray method in ParquetHiveSerDe strips null values from resultant 
ArrayWritables.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6994) parquet-hive createArray strips null elements


 [ 
https://issues.apache.org/jira/browse/HIVE-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6994:


Status: Patch Available  (was: Open)

This patch fixes the issue in ParquetHiveSerDe, but there may be an underlying 
issue in parquet (this is still under investigation).

 parquet-hive createArray strips null elements
 -

 Key: HIVE-6994
 URL: https://issues.apache.org/jira/browse/HIVE-6994
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Justin Coffey
Assignee: Justin Coffey
 Fix For: 0.14.0

 Attachments: HIVE-6994.patch


 The createArray method in ParquetHiveSerDe strips null values from resultant 
 ArrayWritables.
 tracked here as well: https://github.com/Parquet/parquet-mr/issues/377



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6994) parquet-hive createArray strips null elements


 [ 
https://issues.apache.org/jira/browse/HIVE-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6994:


Attachment: HIVE-6994.patch

 parquet-hive createArray strips null elements
 -

 Key: HIVE-6994
 URL: https://issues.apache.org/jira/browse/HIVE-6994
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Justin Coffey
Assignee: Justin Coffey
 Fix For: 0.14.0

 Attachments: HIVE-6994.patch


 The createArray method in ParquetHiveSerDe strips null values from resultant 
 ArrayWritables.
 tracked here as well: https://github.com/Parquet/parquet-mr/issues/377



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6994) parquet-hive createArray strips null elements


 [ 
https://issues.apache.org/jira/browse/HIVE-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6994:


Description: 
The createArray method in ParquetHiveSerDe strips null values from resultant 
ArrayWritables.

tracked here as well: https://github.com/Parquet/parquet-mr/issues/377

  was:The createArray method in ParquetHiveSerDe strips null values from 
resultant ArrayWritables.


 parquet-hive createArray strips null elements
 -

 Key: HIVE-6994
 URL: https://issues.apache.org/jira/browse/HIVE-6994
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Justin Coffey
Assignee: Justin Coffey
 Fix For: 0.14.0


 The createArray method in ParquetHiveSerDe strips null values from resultant 
 ArrayWritables.
 tracked here as well: https://github.com/Parquet/parquet-mr/issues/377



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Review Request 20899: HIVE-6994 - parquet-hive createArray strips null elements

2014-04-30 Thread justin coffey


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20899/
---

Review request for hive.


Repository: hive-git


Description
---

- Fix for bug in createArray() that strips null elements.
- In the process refactored serde for simplification purposes.
- Refactored tests for better regression testing.


Diffs
-

  data/files/parquet_create.txt ccd48ee 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 
b689336 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 
be518b9 
  
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetHiveSerDe.java
 PRE-CREATION 
  ql/src/test/queries/clientpositive/parquet_create.q 0b976bd 
  ql/src/test/results/clientpositive/parquet_create.q.out 3220be5 

Diff: https://reviews.apache.org/r/20899/diff/


Testing
---


Thanks,

justin coffey

[jira] [Updated] (HIVE-6920) Parquet Serde Simplification


 [ 
https://issues.apache.org/jira/browse/HIVE-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6920:


Status: Open  (was: Patch Available)

Please see superceding issue here: #HIVE-6994

 Parquet Serde Simplification
 

 Key: HIVE-6920
 URL: https://issues.apache.org/jira/browse/HIVE-6920
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.14.0

 Attachments: HIVE-6920.patch


 Various fixes and code simplification in the ParquetHiveSerde (with minor 
 optimizations)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6994) parquet-hive createArray strips null elements


[ 
https://issues.apache.org/jira/browse/HIVE-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985772#comment-13985772
 ] 

Justin Coffey commented on HIVE-6994:
-

review board link: https://reviews.apache.org/r/20899/

 parquet-hive createArray strips null elements
 -

 Key: HIVE-6994
 URL: https://issues.apache.org/jira/browse/HIVE-6994
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Justin Coffey
Assignee: Justin Coffey
 Fix For: 0.14.0

 Attachments: HIVE-6994.patch


 The createArray method in ParquetHiveSerDe strips null values from resultant 
 ArrayWritables.
 tracked here as well: https://github.com/Parquet/parquet-mr/issues/377



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6920) Parquet Serde Simplification


[ 
https://issues.apache.org/jira/browse/HIVE-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985777#comment-13985777
 ] 

Justin Coffey commented on HIVE-6920:
-

btw, in the superceding patch, I killed the pom bump to 1.4.1, but I neglected 
to remove my spurious comments about serde stats :).  I'm late out the door 
right now, but if you have a chance to check the new patch and review board 
request, I can clean that up before a final commit :D.

 Parquet Serde Simplification
 

 Key: HIVE-6920
 URL: https://issues.apache.org/jira/browse/HIVE-6920
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.14.0

 Attachments: HIVE-6920.patch


 Various fixes and code simplification in the ParquetHiveSerde (with minor 
 optimizations)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6920) Parquet Serde Simplification

2014-04-29 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984183#comment-13984183
 ] 

Justin Coffey commented on HIVE-6920:
-

bump?

I'd like to build off of this for a bug fix that I need to submit.

 Parquet Serde Simplification
 

 Key: HIVE-6920
 URL: https://issues.apache.org/jira/browse/HIVE-6920
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.14.0

 Attachments: HIVE-6920.patch


 Various fixes and code simplification in the ParquetHiveSerde (with minor 
 optimizations)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Review Request 20710: HIVE-6920 - Parquet Serde Simplification

2014-04-25 Thread justin coffey


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20710/
---

Review request for hive.


Repository: hive-git


Description
---

Refactoring for simplification of the parquet-hive serde.


Diffs
-

  pom.xml 426dca8 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 
b689336 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 
be518b9 
  
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetHiveSerDe.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/20710/diff/


Testing
---


Thanks,

justin coffey

[jira] [Commented] (HIVE-6920) Parquet Serde Simplification

2014-04-25 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981072#comment-13981072
 ] 

Justin Coffey commented on HIVE-6920:
-

It's actually mostly just code reduction. Here's the RB link: 
https://reviews.apache.org/r/20710/

thanks :)

 Parquet Serde Simplification
 

 Key: HIVE-6920
 URL: https://issues.apache.org/jira/browse/HIVE-6920
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.14.0

 Attachments: HIVE-6920.patch


 Various fixes and code simplification in the ParquetHiveSerde (with minor 
 optimizations)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6920) Parquet Serde Simplification

2014-04-17 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972790#comment-13972790
 ] 

Justin Coffey commented on HIVE-6920:
-

cc: [~brocknoland] [~xuefuz]

 Parquet Serde Simplification
 

 Key: HIVE-6920
 URL: https://issues.apache.org/jira/browse/HIVE-6920
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.14.0

 Attachments: HIVE-6920.patch


 Various fixes and code simplification in the ParquetHiveSerde (with minor 
 optimizations)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HIVE-6920) Parquet Serde Simplification

Justin Coffey created HIVE-6920:
---

 Summary: Parquet Serde Simplification
 Key: HIVE-6920
 URL: https://issues.apache.org/jira/browse/HIVE-6920
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.14.0


Various fixes and code simplification in the ParquetHiveSerde (with minor 
optimizations)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6920) Parquet Serde Simplification


 [ 
https://issues.apache.org/jira/browse/HIVE-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6920:


Attachment: HIVE-6920.patch

 Parquet Serde Simplification
 

 Key: HIVE-6920
 URL: https://issues.apache.org/jira/browse/HIVE-6920
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.14.0

 Attachments: HIVE-6920.patch


 Various fixes and code simplification in the ParquetHiveSerde (with minor 
 optimizations)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6920) Parquet Serde Simplification


 [ 
https://issues.apache.org/jira/browse/HIVE-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6920:


Release Note: 
- Removed unused serde stats
- Simplified initialize code
- Renamed test class to match serde class name
- Separated serialize and deserialize tests
  Status: Patch Available  (was: Open)

 Parquet Serde Simplification
 

 Key: HIVE-6920
 URL: https://issues.apache.org/jira/browse/HIVE-6920
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.14.0

 Attachments: HIVE-6920.patch


 Various fixes and code simplification in the ParquetHiveSerde (with minor 
 optimizations)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6920) Parquet Serde Simplification


 [ 
https://issues.apache.org/jira/browse/HIVE-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6920:


Release Note: 
- Removed unused serde stats
- Simplified initialize code
- Renamed test class to match serde class name
- Separated serialize and deserialize tests
- Bumped Parquet version to 1.4.1

  was:
- Removed unused serde stats
- Simplified initialize code
- Renamed test class to match serde class name
- Separated serialize and deserialize tests


 Parquet Serde Simplification
 

 Key: HIVE-6920
 URL: https://issues.apache.org/jira/browse/HIVE-6920
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.14.0

 Attachments: HIVE-6920.patch


 Various fixes and code simplification in the ParquetHiveSerde (with minor 
 optimizations)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6784) parquet-hive should allow column type change

2014-04-11 Thread Justin Coffey (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966341#comment-13966341
]

Justin Coffey commented on HIVE-6784:
-

You've cited a lazy serde. Parquet is not lazy. It is similar to ORC.

Have a look ORC's deserialize() method
(org.apache.hadoop.hive.ql.io.orc.OrcSerde):
{code}
@Override
public Object deserialize(Writable writable) throws SerDeException {
return writable;
}
{code}

A quick look through ORC code indicates to me that they don't do any reparsing
(though I might have missed something).

Looking through other serde's not a single one (that I checked) reparses
values. Value parsing is handled in ObjectInspectors (poke around
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils).

In my opinion, the *substantial* performance penalty that you are introducing
with this patch is going to be a much bigger negative to adopting parquet than
obliging people to rebuild their data set in the rare event that you have to
change a type.

And if you do need to change a type, insert overwrite table is a good work
around.

-1

parquet-hive should allow column type change

Key: HIVE-6784
URL: https://issues.apache.org/jira/browse/HIVE-6784
Project: Hive
Issue Type: Bug
Components: File Formats, Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Tongjie Chen
Fix For: 0.14.0

Attachments: HIVE-6784.1.patch.txt, HIVE-6784.2.patch.txt

see also in the following parquet issue:
https://github.com/Parquet/parquet-mr/issues/323
Currently, if we change parquet format hive table using alter table
parquet_table change c1 c1 bigint ( assuming original type of c1 is int),
it will result in exception thrown from SerDe:
org.apache.hadoop.io.IntWritable cannot be cast to
org.apache.hadoop.io.LongWritable in query runtime.
This is different behavior from hive (using other file format), where it will
try to perform cast (null value in case of incompatible type).
Parquet Hive's RecordReader returns an ArrayWritable (based on schema stored
in footers of parquet files); ParquetHiveSerDe also creates an corresponding
ArrayWritableObjectInspector (but using column type info from metastore).
Whenever there is column type change, the objector inspector will throw
exception, since WritableLongObjectInspector cannot inspect an IntWritable
etc...
Conversion has to happen somewhere if we want to allow type change. SerDe's
deserialize method seems a natural place for it.
Currently, serialize method calls createStruct (then createPrimitive) for
every record, but it creates a new object regardless, which seems expensive.
I think that could be optimized a bit by just returning the object passed if
already of the right type. deserialize also reuse this method, if there is a
type change, there will be new object to be created, which I think is
inevitable.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6784) parquet-hive should allow column type change

2014-04-10 Thread Justin Coffey (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965279#comment-13965279
]

Justin Coffey commented on HIVE-6784:
-

-1 on this patch.

Looping on the arraywriteable in deserialize() will cause a performance penalty
at read time, and running parseXxxx(obj.toString) in the event of a type
mismatch is also painful.

Changing types of columns is a rare event, we shouldn't write code that will
cause performance penalties to handle it. Users should recreate the table with
the new type and load it from the old table casting and converting as
appropriate in their query.

parquet-hive should allow column type change

Attachments: HIVE-6784.1.patch.txt, HIVE-6784.2.patch.txt

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6757) Remove deprecated parquet classes from outside of org.apache package

2014-04-07 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961892#comment-13961892
 ] 

Justin Coffey commented on HIVE-6757:
-

much appreciated Harish!

 Remove deprecated parquet classes from outside of org.apache package
 

 Key: HIVE-6757
 URL: https://issues.apache.org/jira/browse/HIVE-6757
 Project: Hive
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley
Priority: Blocker
 Fix For: 0.13.0

 Attachments: HIVE-6757.2.patch, HIVE-6757.patch, parquet-hive.patch


 Apache shouldn't release projects with files outside of the org.apache 
 namespace.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6757) Remove deprecated parquet classes from outside of org.apache package

2014-04-03 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959178#comment-13959178
 ] 

Justin Coffey commented on HIVE-6757:
-

I find that to be an acceptable compromise.

consensus :).

 Remove deprecated parquet classes from outside of org.apache package
 

 Key: HIVE-6757
 URL: https://issues.apache.org/jira/browse/HIVE-6757
 Project: Hive
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley
Priority: Blocker
 Fix For: 0.13.0

 Attachments: HIVE-6757.patch, parquet-hive.patch


 Apache shouldn't release projects with files outside of the org.apache 
 namespace.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6757) Remove deprecated parquet classes from outside of org.apache package

2014-03-31 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13955121#comment-13955121
 ] 

Justin Coffey commented on HIVE-6757:
-

I can +1 [~brocknoland]'s solution if that flies for everyone else.  Actually, 
we joked about this in one of our review sessions here thinking that it was a 
bit of a brute force solution, but if this works for everyone it works for us 
(FYI, for one table we expect to have 47K partitions to update).

 Remove deprecated parquet classes from outside of org.apache package
 

 Key: HIVE-6757
 URL: https://issues.apache.org/jira/browse/HIVE-6757
 Project: Hive
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley
Priority: Blocker
 Fix For: 0.13.0

 Attachments: HIVE-6757.patch, parquet-hive.patch


 Apache shouldn't release projects with files outside of the org.apache 
 namespace.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6757) Remove deprecated parquet classes from outside of org.apache package

2014-03-28 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950531#comment-13950531
 ] 

Justin Coffey commented on HIVE-6757:
-

Owen, the solution your proposing means that there is no seamless upgrade path 
for existing parquet-hive users and that somewhere on the hive wiki there will 
have to be a call out attention existing parquet users, you must include the 
parquet-hive.jar when upgrading to hive 13.  we're sorry, but this is the price 
you have to pay for being an early adopter and driving functionality.

One of the goals of the #HIVE-5783 patch was to make the lives of parquet users 
easier (there were of course many other reasons, but ease of use is a good goal 
in and of itself).  The classes as they are do no harm and it's hard to see how 
they pollute the code base of Hive in any significant way.  This patch kinda 
sorta seems a tiny bit punitive if you ask me.

Please don't take any of this the wrong way, but I believe this is what a fair 
chunk of the parquet-hive community might think if this patch is committed.

 Remove deprecated parquet classes from outside of org.apache package
 

 Key: HIVE-6757
 URL: https://issues.apache.org/jira/browse/HIVE-6757
 Project: Hive
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley
Priority: Blocker
 Fix For: 0.13.0

 Attachments: HIVE-6757.patch, parquet-hive.patch


 Apache shouldn't release projects with files outside of the org.apache 
 namespace.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6757) Remove deprecated parquet classes from outside of org.apache package

2014-03-28 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950996#comment-13950996
 ] 

Justin Coffey commented on HIVE-6757:
-

I guess my point is simply that early adopters are penalized for life whereas 
new users get the full benefit of the patch.  I agree that the penalty is 
pretty small, but the two classes kicking around in the parquet package are 
even less of a penalty to the hive code base.  Thus I remain against pulling 
them out.

 Remove deprecated parquet classes from outside of org.apache package
 

 Key: HIVE-6757
 URL: https://issues.apache.org/jira/browse/HIVE-6757
 Project: Hive
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley
Priority: Blocker
 Fix For: 0.13.0

 Attachments: HIVE-6757.patch, parquet-hive.patch


 Apache shouldn't release projects with files outside of the org.apache 
 namespace.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Review Request 18925: HIVE-6575 select * fails on parquet table with map datatype

2014-03-10 Thread justin coffey


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18925/#review36644
---

Ship it!


go for r3 with the getClass (and no instanceof) check and {} formatting.

- justin coffey


On March 8, 2014, 12:01 a.m., Szehon Ho wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/18925/
 ---
 
 (Updated March 8, 2014, 12:01 a.m.)
 
 
 Review request for hive, Brock Noland, justin coffey, and Xuefu Zhang.
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 The issue is, as part of select * query, a DeepParquetHiveMapInspector is 
 used for one column of an overall parquet-table struct object inspector.  
 
 The problem lies in the ObjectInspectorFactory's cache for struct object 
 inspector.  For performance, there is a cache keyed on an array list, of all 
 object inspectors of columns.  The second time the query is run, it attempts 
 to lookup cached struct inspector.  But when the hashmap looks up the part of 
 the key consisting of the DeepParquetHiveMapInspector, java calls .equals 
 against the existing DeepParquetHivemapInspector.  This fails, as the .equals 
 method casted the other to a StandardParquetHiveInspector.
 
 Regenerating the .equals and .hashcode from eclipse.  
 
 Also adding one more check in .equals before casting, to handle the case if 
 another class of object inspector gets hashed to the same hashcode in the 
 cache.  Then java would call .equals against the other, which in this case is 
 not of the same class.
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/AbstractParquetMapInspector.java
  1d72747 
 
 Diff: https://reviews.apache.org/r/18925/diff/
 
 
 Testing
 ---
 
 Manual testing.
 
 
 Thanks,
 
 Szehon Ho

[jira] [Commented] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-03-07 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924095#comment-13924095
 ] 

Justin Coffey commented on HIVE-6414:
-

hello, I don't think these are related to the patch, so resubmitting for 
retesting.

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Remus Rusanu
Assignee: Justin Coffey
  Labels: Parquet
 Fix For: 0.13.0

 Attachments: HIVE-6414.2.patch, HIVE-6414.3.patch, HIVE-6414.3.patch, 
 HIVE-6414.patch


 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-03-07 Thread Justin Coffey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6414:


Attachment: HIVE-6414.3.patch

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Remus Rusanu
Assignee: Justin Coffey
  Labels: Parquet
 Fix For: 0.13.0

 Attachments: HIVE-6414.2.patch, HIVE-6414.3.patch, HIVE-6414.3.patch, 
 HIVE-6414.3.patch, HIVE-6414.patch


 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-02-27 Thread Justin Coffey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6414:


Attachment: HIVE-6414.3.patch

Update patch based on comments from Xuefu.

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Remus Rusanu
Assignee: Justin Coffey
  Labels: Parquet
 Fix For: 0.13.0

 Attachments: HIVE-6414.2.patch, HIVE-6414.3.patch, HIVE-6414.patch


 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-02-26 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13912643#comment-13912643
 ] 

Justin Coffey commented on HIVE-6414:
-

[~xuefuz] ok will recheck qtest and resubmit with nulls and not exceptions.  I 
wasn't sure what the behavior should be in the case of an overflow.

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Remus Rusanu
Assignee: Justin Coffey
  Labels: Parquet
 Fix For: 0.13.0

 Attachments: HIVE-6414.2.patch, HIVE-6414.patch


 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-02-25 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911585#comment-13911585
 ] 

Justin Coffey commented on HIVE-6414:
-

Hi Szehon, I worked off of the trunk on this.  We are applying cleanly to the 
latest commit and unit tests pass, but our qtest fails after the commit for 
#HIVE-5958.  qtests for parquet_create.q work just fine though.

We're digging into it.

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Remus Rusanu
Assignee: Justin Coffey
  Labels: Parquet
 Fix For: 0.13.0

 Attachments: HIVE-6414.patch


 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-02-25 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911587#comment-13911587
 ] 

Justin Coffey commented on HIVE-6414:
-

Oh, and we don't appear to need the order by for deterministic tests, but I 
have added it and will submit an updated patch with it (once we have gotten to 
the bottom of these failures).

btw are your qtests passing in #HIVE-6477?

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Remus Rusanu
Assignee: Justin Coffey
  Labels: Parquet
 Fix For: 0.13.0

 Attachments: HIVE-6414.patch


 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-02-25 Thread Justin Coffey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6414:


Attachment: HIVE-6414.2.patch

Updated patch with working unit and qtests applicable to trunk commit: 
6010e22bd24d5004990c63f0aeb232d75693dd94 (#HIVE-5954)

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Remus Rusanu
Assignee: Justin Coffey
  Labels: Parquet
 Fix For: 0.13.0

 Attachments: HIVE-6414.2.patch, HIVE-6414.patch


 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-02-24 Thread Justin Coffey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6414:


Fix Version/s: 0.13.0
Affects Version/s: 0.13.0
   Status: Patch Available  (was: Open)

the patch was developed against this commit: 
b05004a863b09cbe5f4b734c5474092f328f0c41

unit tests and qtests run fine against this commit.

the latest commit (as of today): 1a3608d8b1f8cf41e9ba2fc7e9bacdecf271bb92

Appears to have broken qtests (none will run) and so I can't verify the patch 
specific qtest.  Unit tests, however, execute without error.

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Remus Rusanu
Assignee: Justin Coffey
  Labels: Parquet
 Fix For: 0.13.0


 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-02-24 Thread Justin Coffey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-6414:


Attachment: HIVE-6414.patch

Credit should be given to Remy Pecqueur r.pecqu...@criteo.com

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.13.0
Reporter: Remus Rusanu
Assignee: Justin Coffey
  Labels: Parquet
 Fix For: 0.13.0

 Attachments: HIVE-6414.patch


 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6456) Implement Parquet schema evolution

2014-02-19 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905450#comment-13905450
 ] 

Justin Coffey commented on HIVE-6456:
-

brock and I had the same thought offline.  Not sure what the protocol is here: 
should I open a separate ticket?

 Implement Parquet schema evolution
 --

 Key: HIVE-6456
 URL: https://issues.apache.org/jira/browse/HIVE-6456
 Project: Hive
  Issue Type: Improvement
Reporter: Brock Noland
Assignee: Brock Noland
Priority: Trivial
 Attachments: HIVE-6456.patch


 In HIVE-5783 we removed schema evolution:
 https://github.com/Parquet/parquet-mr/pull/297/files#r9824155



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (HIVE-6463) unit test for evoloving schema in parquet files

2014-02-19 Thread Justin Coffey (JIRA)

Justin Coffey created HIVE-6463:
---

 Summary: unit test for evoloving schema in parquet files
 Key: HIVE-6463
 URL: https://issues.apache.org/jira/browse/HIVE-6463
 Project: Hive
  Issue Type: Test
Reporter: Justin Coffey
Assignee: Justin Coffey


Unit test(s) for patch found in #HIVE-6456



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6456) Implement Parquet schema evolution

2014-02-19 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905459#comment-13905459
 ] 

Justin Coffey commented on HIVE-6456:
-

done and linked.

 Implement Parquet schema evolution
 --

 Key: HIVE-6456
 URL: https://issues.apache.org/jira/browse/HIVE-6456
 Project: Hive
  Issue Type: Improvement
Reporter: Brock Noland
Assignee: Brock Noland
Priority: Trivial
 Attachments: HIVE-6456.patch


 In HIVE-5783 we removed schema evolution:
 https://github.com/Parquet/parquet-mr/pull/297/files#r9824155



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6456) Improve Parquet schema evolution

2014-02-18 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904423#comment-13904423
 ] 

Justin Coffey commented on HIVE-6456:
-

good to go.  thanks for the fast work!

 Improve Parquet schema evolution
 

 Key: HIVE-6456
 URL: https://issues.apache.org/jira/browse/HIVE-6456
 Project: Hive
  Issue Type: Improvement
Reporter: Brock Noland
Assignee: Brock Noland
Priority: Trivial
 Attachments: HIVE-6456.patch


 In HIVE-5783 we removed schema evolution:
 https://github.com/Parquet/parquet-mr/pull/297/files#r9824155



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-02-12 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899284#comment-13899284
 ] 

Justin Coffey commented on HIVE-6414:
-

I'll investigate.

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
Reporter: Remus Rusanu
Assignee: Justin Coffey

 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Assigned] (HIVE-6414) ParquetInputFormat provides data values that do not match the object inspectors

2014-02-12 Thread Justin Coffey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey reassigned HIVE-6414:
---

Assignee: Justin Coffey

 ParquetInputFormat provides data values that do not match the object 
 inspectors
 ---

 Key: HIVE-6414
 URL: https://issues.apache.org/jira/browse/HIVE-6414
 Project: Hive
  Issue Type: Bug
Reporter: Remus Rusanu
Assignee: Justin Coffey

 While working on HIVE-5998 I noticed that the ParquetRecordReader returns 
 IntWritable for all 'int like' types, in disaccord with the row object 
 inspectors. I though fine, and I worked my way around it. But I see now that 
 the issue trigger failuers in other places, eg. in aggregates:
 {noformat}
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row 
 {cint:528534767,ctinyint:31,csmallint:4963,cfloat:31.0,cdouble:4963.0,cstring1:cvLH6Eat2yFsyy7p}
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
 at 
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
 ... 8 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast 
 to java.lang.Short
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:808)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:790)
 at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524)
 ... 9 more
 Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
 cannot be cast to java.lang.Short
 at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:671)
 at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:631)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.merge(GenericUDAFMin.java:109)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMin$GenericUDAFMinEvaluator.iterate(GenericUDAFMin.java:96)
 at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:838)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:735)
 at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:803)
 ... 15 more
 {noformat}
 My test is (I'm writing a test .q from HIVE-5998, but the repro does not 
 involve vectorization):
 {noformat}
 create table if not exists alltypes_parquet (
   cint int,
   ctinyint tinyint,
   csmallint smallint,
   cfloat float,
   cdouble double,
   cstring1 string) stored as parquet;
 insert overwrite table alltypes_parquet
   select cint,
 ctinyint,
 csmallint,
 cfloat,
 cdouble,
 cstring1
   from alltypesorc;
 explain select * from alltypes_parquet limit 10; select * from 
 alltypes_parquet limit 10;
 explain select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 select ctinyint,
   max(cint),
   min(csmallint),
   count(cstring1),
   avg(cfloat),
   stddev_pop(cdouble)
   from alltypes_parquet
   group by ctinyint;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2014-02-10 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13896303#comment-13896303
 ] 

Justin Coffey commented on HIVE-5783:
-

Thanks to all, and especially [~brocknoland] for all his help!

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.13.0

 Attachments: HIVE-5783.noprefix.patch, HIVE-5783.noprefix.patch, 
 HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, 
 HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, 
 HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, 
 HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, 
 HIVE-5783.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-5783) Native Parquet Support in Hive

2014-01-24 Thread Justin Coffey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-5783:


Attachment: HIVE-5783.patch

The updated patch.  This fixes incorrect behavior when using HiveInputSplits.  
Regression tests have been added as a qtest (parquet_partitioned.q).

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.13.0

 Attachments: HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, 
 HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2014-01-23 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879960#comment-13879960
 ] 

Justin Coffey commented on HIVE-5783:
-

We have unfortunately found a bug in MapredParquetInputFormat.  We are working 
on a fix and will resubmit a patch once tested.

Sorry :(

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.13.0

 Attachments: HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, 
 HIVE-5783.patch, HIVE-5783.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2014-01-21 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13877509#comment-13877509
 ] 

Justin Coffey commented on HIVE-5783:
-

[~leftylev], if you'd like I can give this a review and propose changes.

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.13.0

 Attachments: HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch, 
 HIVE-5783.patch, HIVE-5783.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-5783) Native Parquet Support in Hive


 [ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-5783:


Attachment: (was: parquet-hive.patch)

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Attachments: HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-5783) Native Parquet Support in Hive


 [ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-5783:


Attachment: (was: hive-0.11-parquet.patch)

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Attachments: HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-5783) Native Parquet Support in Hive


 [ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-5783:


Attachment: HIVE-5783.patch

without license or author tags.

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Attachments: HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-5783) Native Parquet Support in Hive


 [ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-5783:


Attachment: (was: HIVE-5783.patch)

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Attachments: HIVE-5783.patch, HIVE-5783.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-5783) Native Parquet Support in Hive


 [ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-5783:


Attachment: HIVE-5783.patch

this is the good one.  had a final dependency to clean up.

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Attachments: HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

[
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876712#comment-13876712
]

Justin Coffey commented on HIVE-5783:
-

Sorry for the spam in posts. Latest patch is good:
- no author tags
- no criteo copyright
- builds against latest version of parquet (1.3.2)

I attempted to create a review.apache.org review, but am unable to publish it
because I can't assign any reviewers.

Native Parquet Support in Hive
--

Key: HIVE-5783
URL: https://issues.apache.org/jira/browse/HIVE-5783
Project: Hive
Issue Type: New Feature
Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
Attachments: HIVE-5783.patch, HIVE-5783.patch, HIVE-5783.patch

Problem Statement:
Hive would be easier to use if it had native Parquet support. Our
organization, Criteo, uses Hive extensively. Therefore we built the Parquet
Hive integration and would like to now contribute that integration to Hive.
About Parquet:
Parquet is a columnar storage format for Hadoop and integrates with many
Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading,
Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native
Parquet integration.
Changes Details:
Parquet was built with dependency management in mind and therefore only a
single Parquet jar will be added as a dependency.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2014-01-19 Thread Justin Coffey (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875930#comment-13875930
]

Justin Coffey commented on HIVE-5783:
-

Hi [~cwsteinbach]. Actually, that looks like just a boilerplate auto insertion
in the affected class files. The ASF license is on our short list of approved
OSS licenses, so I don't think it will be an issue for me to strip that out and
resubmit. I'll just double check all is well and resubmit Monday.

Native Parquet Support in Hive
--

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-5783) Native Parquet Support in Hive

2014-01-17 Thread Justin Coffey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-5783:


Attachment: parquet-hive.patch

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Attachments: HIVE-5783.patch, hive-0.11-parquet.patch, 
 parquet-hive.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2014-01-17 Thread Justin Coffey (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13874626#comment-13874626
]

Justin Coffey commented on HIVE-5783:
-

After much delay, here is the patch. This integrates the former parquet-hive
project directly into ql.io.parquet.

There is a qtest file (modeled on that of ORC) and unit tests for much of the
code.

This applies cleanly to the commit 3a7cea58ababfbbbdb6eac97fefa4298337b7c06 on
the branch-0.11.

Comments welcome :).

Native Parquet Support in Hive
--

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2014-01-17 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13874666#comment-13874666
 ] 

Justin Coffey commented on HIVE-5783:
-

[~rusanu]: like so?
https://reviews.facebook.net/differential/diff/47487/

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Attachments: HIVE-5783.patch, hive-0.11-parquet.patch, 
 parquet-hive.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2013-12-18 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851661#comment-13851661
 ] 

Justin Coffey commented on HIVE-5783:
-

Yes this is true.  We are refactoring to merge the whole parquet-hive project 
into hive.  There are a couple of folks involved at this point and so it's 
taking a smidgen extra time what with holidays and all.

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Attachments: HIVE-5783.patch, hive-0.11-parquet.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2013-12-10 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844116#comment-13844116
 ] 

Justin Coffey commented on HIVE-5783:
-

[~cwsteinbach] all sounds good.  Regarding test cases, I had some QTests 
prepared, but they were excluded from the initial patch to keep it as minimal 
as possible.  We'll be sure to have full test coverage with the follow up patch.

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.11.0

 Attachments: HIVE-5783.patch, hive-0.11-parquet.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2013-12-09 Thread Justin Coffey (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843425#comment-13843425
 ] 

Justin Coffey commented on HIVE-5783:
-

Hi [~cwsteinbach], so on the parquet-hive side, we're good to submit a new 
patch with direct serde integration.  We'll work on that presently.

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.11.0

 Attachments: HIVE-5783.patch, hive-0.11-parquet.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

2013-12-09 Thread Justin Coffey (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843428#comment-13843428
]

Justin Coffey commented on HIVE-5783:
-

(sorry, errant trackpad submit on the last comment)

I wanted to add that I think the registry/format factory refactoring of the
BaseSemanticAnalyzer still seems out of scope for this request. There is
willingness to work on that on a different ticket, but I humbly submit that the
two are not linked and one should not impede the other.

Good?

Native Parquet Support in Hive
--

Key: HIVE-5783
URL: https://issues.apache.org/jira/browse/HIVE-5783
Project: Hive
Issue Type: New Feature
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
Fix For: 0.11.0

Attachments: HIVE-5783.patch, hive-0.11-parquet.patch

--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Updated] (HIVE-5783) Native Parquet Support in Hive


 [ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-5783:


Fix Version/s: 0.11.0
 Release Note: adds stored as parquet and setting parquet as the default 
storage engine.
   Status: Patch Available  (was: Open)

built and tested against hive 0.11--a rebase will be necessary to work against 
the trunk

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.11.0

 Attachments: hive-0.11-parquet.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HIVE-5783) Native Parquet Support in Hive


 [ 
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Coffey updated HIVE-5783:


Attachment: hive-0.11-parquet.patch

 Native Parquet Support in Hive
 --

 Key: HIVE-5783
 URL: https://issues.apache.org/jira/browse/HIVE-5783
 Project: Hive
  Issue Type: New Feature
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
 Fix For: 0.11.0

 Attachments: hive-0.11-parquet.patch


 Problem Statement:
 Hive would be easier to use if it had native Parquet support. Our 
 organization, Criteo, uses Hive extensively. Therefore we built the Parquet 
 Hive integration and would like to now contribute that integration to Hive.
 About Parquet:
 Parquet is a columnar storage format for Hadoop and integrates with many 
 Hadoop ecosystem tools such as Thrift, Avro, Hadoop MapReduce, Cascading, 
 Pig, Drill, Crunch, and Hive. Pig, Crunch, and Drill all contain native 
 Parquet integration.
 Changes Details:
 Parquet was built with dependency management in mind and therefore only a 
 single Parquet jar will be added as a dependency.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive

[
https://issues.apache.org/jira/browse/HIVE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13841666#comment-13841666
]

Justin Coffey commented on HIVE-5783:
-

[~appodictic], regarding the support being built into the semantic analyzer, I
mimicked what was done for ORC support. I agree that a hard coded switch
statement is not the best approach, but thought a larger refactoring was out of
scope for this request--and definitely not something to be done against the
0.11 branch :). Now with trunk support for parquet-hive I suppose we could
tackle this in a more generic/robust way.

[~xuefuz], do you mean the actual parquet input/output formats and serde? If
so, these are in the parquet-hive project
(https://github.com/Parquet/parquet-mr/tree/master/parquet-hive).

Native Parquet Support in Hive
--

Key: HIVE-5783
URL: https://issues.apache.org/jira/browse/HIVE-5783
Project: Hive
Issue Type: New Feature
Reporter: Justin Coffey
Assignee: Justin Coffey
Priority: Minor
Fix For: 0.11.0

Attachments: hive-0.11-parquet.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HIVE-5783) Native Parquet Support in Hive