subject:"Parquet support \(HIVE\-5783\)"

Re: Parquet support (HIVE-5783)

2014-02-21 Thread Brock Noland

Hi,

Storage handlers muddle the waters a bit IMO. That interface was
written for storage that is not file-based, e.g. hbase. Whereas Avro,
Parquet, Sequence File, etc are all file based.

I think we have to be practical about confusion. There are so many
Hadoop newbies out there, almost all of them new to Apache as well,
that there is going to be some confusion. For example, one person who
had been using Hadoop and Hive for a few months said to me Hive moved
*from* Apache to Hortonworks. At the end of the day, regardless of
what we do, some level of confusion is going to persist amongst those
new to the ecosystem.

With that said, I do think that an overview of Hive Storage would be
a great addition to our documentation.

Brock

On Fri, Feb 21, 2014 at 1:27 AM, Lefty Leverenz leftylever...@gmail.com wrote:
This is in the Terminology
sectionhttps://cwiki.apache.org/confluence/display/Hive/StorageHandlers#StorageHandlers-Terminology
of
the Storage Handlers doc:

Storage handlers introduce a distinction between *native* and
*non-native* tables.
A native table is one which Hive knows how to manage and access without a
storage handler; a non-native table is one which requires a storage handler.

It goes on to say that non-native tables are created with a STORED BY
clause (as opposed to a STORED AS clause).

Does that clarify or muddy the waters?

-- Lefty

On Thu, Feb 20, 2014 at 7:37 PM, Lefty Leverenz
leftylever...@gmail.comwrote:

Some of these issues can be addressed in the documentation. The File
Formats section of the Language Manual needs an overview, and that might
be a good place to explain the differences between Hive-owned formats and
external formats. Or the SerDe doc could be beefed up: Built-In
SerDeshttps://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inSerDes
.

In the meantime, I've added a link to the Avro doc in the File Formats
list and mentioned Parquet in DDL's Row Format, Storage Format, and
SerDehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,StorageFormat,andSerDesection:

Use STORED AS PARQUET (without ROW FORMAT SERDE) for the
Parquethttps://cwiki.apache.org/confluence/display/Hive/Parquet columnar
storage format in Hive 0.13.0 and
laterhttps://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.13andlater;
or use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT ... in
Hive
0.10, 0.11, or
0.12https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0.12
.

Does that work?

-- Lefty

On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland br...@cloudera.com wrote:

Hi Alan,

Response is inline, below:

On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates ga...@hortonworks.com
wrote:
Gunther, is it the case that there is anything extra that needs to be
done to ship Parquet code with Hive right now? If I read the patch
correctly the Parquet jars were added to the pom and thus will be shipped
as part of Hive. As long as it works out of the box when a user says
create table ... stored as parquet why do we care whether the parquet jar
is owned by Hive or another project?

The concern about feature mismatch in Parquet versus Hive is valid, but
I'm not sure what to do about it other than assure that there are good
error messages. Users will often want to use non-Hive based storage
formats (Parquet, Avro, etc.). This means we need a good way to detect at
SQL compile time that the underlying storage doesn't support the indicated
data type and throw a good error.

Agreed, the error messages should absolutely be good. I will ensure
this is the case via https://issues.apache.org/jira/browse/HIVE-6457

Also, it's important to be clear going forward about what Hive as a
project is signing up for. If tomorrow someone decides to add a new
datatype or feature we need to be clear that we expect the contributor to
make this work for Hive owned formats (text, RC, sequence, ORC) but not
necessarily for external formats

This makes sense to me.

I'd just like to add that I have a patch available to improve the
hive-exec uber jar and general query speed:
https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a
patch available to finish the generic STORED AS functionality:
https://issues.apache.org/jira/browse/HIVE-5976

Brock

--
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: Parquet support (HIVE-5783)

2014-02-20 Thread Lefty Leverenz

Use STORED AS PARQUET (without ROW FORMAT SERDE) for the
Parquethttps://cwiki.apache.org/confluence/display/Hive/Parquet
columnar
storage format in Hive 0.13.0 and
laterhttps://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.13andlater;
or use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT ... in Hive
0.10, 0.11, or
0.12https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0.12
.

Does that work?

-- Lefty

On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland br...@cloudera.com wrote:

Hi Alan,

Response is inline, below:

Agreed, the error messages should absolutely be good. I will ensure
this is the case via https://issues.apache.org/jira/browse/HIVE-6457

This makes sense to me.

Brock

Re: Parquet support (HIVE-5783)

2014-02-20 Thread Lefty Leverenz

This is in the Terminology
sectionhttps://cwiki.apache.org/confluence/display/Hive/StorageHandlers#StorageHandlers-Terminology
of
the Storage Handlers doc:

It goes on to say that non-native tables are created with a STORED BY
clause (as opposed to a STORED AS clause).

Does that clarify or muddy the waters?

-- Lefty

On Thu, Feb 20, 2014 at 7:37 PM, Lefty Leverenz leftylever...@gmail.comwrote:

Does that work?

-- Lefty

On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland br...@cloudera.com wrote:

Hi Alan,

Response is inline, below:

Agreed, the error messages should absolutely be good. I will ensure
this is the case via https://issues.apache.org/jira/browse/HIVE-6457

This makes sense to me.

Brock

Re: Parquet support (HIVE-5783)

2014-02-18 Thread Alan Gates

Gunther, is it the case that there is anything extra that needs to be done to 
ship Parquet code with Hive right now?  If I read the patch correctly the 
Parquet jars were added to the pom and thus will be shipped as part of Hive.  
As long as it works out of the box when a user says “create table … stored as 
parquet” why do we care whether the parquet jar is owned by Hive or another 
project?

The concern about feature mismatch in Parquet versus Hive is valid, but I’m not 
sure what to do about it other than assure that there are good error messages.  
Users will often want to use non-Hive based storage formats (Parquet, Avro, 
etc.).  This means we need a good way to detect at SQL compile time that the 
underlying storage doesn’t support the indicated data type and throw a good 
error.

Also, it’s important to be clear going forward about what Hive as a project is 
signing up for.  If tomorrow someone decides to add a new datatype or feature 
we need to be clear that we expect the contributor to make this work for Hive 
owned formats (text, RC, sequence, ORC) but not necessarily for external 
formats (Parquet, Avro).  

Alan.

On Feb 17, 2014, at 7:03 PM, Gunther Hagleitner ghagleit...@hortonworks.com 
wrote:

 Brock,
 
 I'm not trying to pick winners, I'm merely trying to say that the
 documentation/code should match what's actually there, so folks can make
 informed decisions.
 
 The issue I have with the word native is that people have expectations
 when they hear it and I think these are not met.
 
 I've had folks ask me why we're switching the default of hive to Parquet.
 This isn't the case obviously, but native to most people means just that:
 Hive's primary format. That's why I was asking for a title of Add Parquet
 SerDe for the jira. That's the exact same thing that was done for Avro
 under the exact same circumstances:
 https://issues.apache.org/jira/browse/HIVE-895.
 
 Native also has other associations a) it supports the full data
 model/feature set and b) it's part of hive. Neither is the case and I don't
 think that's just a superficial difference. Support and usability will be
 different. That's why I think the documentation should delineate between
 RC/ORC/etc on one side and Parquet/Avro/etc on the other.
 
 As mentioned in the jira STORED AS was reserved for what's actually part
 of hive (or hadoop core in the case of sequence file as you point out). I
 think there are reasons for that: a) being part of the grammar implies
 native as above b) you need to ship the code bundled in hive-exec for this
 to work (which is *broken* right now) and c) like you said we shouldn't
 pick winners by letting some of them become a keyword and others not. For
 these reasons I think Parquet should use the old syntax at this point. If
 you have a pluggable/configurable way great, but right now we don't have
 that.
 
 Finally, yes, I am late to this party and I apologize for that. I'm happy
 to make the suggested changes myself, if that's the concern.
 
 Thanks,
 Gunther.
 
 
 
 On Sun, Feb 16, 2014 at 7:40 PM, Brock Noland br...@cloudera.com wrote:
 
 Hi Gunther,
 
 Please find my response inline.
 
 On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner gunt...@apache.org
 wrote:
 I read through the ticket, patch and documentation
 
 Thank you very much for reading through these items!
 
 and would like to
 suggest some changes.
 
 There was ample time to suggest these changes prior to commit. The
 JIRA was created three months ago, and the title you object to and the
 patch was up there over two months ago.
 
 As far as I can tell this basically adds parquet SerDes to hive, but the
 file format remains external to hive. There is no way for hive devs to
 makes changes, fix bugs add, change datatypes, add features to parquet
 itself.
 
 As stated in many locations including the JIRA discussed here, we
 shouldn't be picking winner/loser file formats. We use many external
 libraries, none of which, all Hive developers have the ability to
 modify. For example most Hive developers do not have the ability to
 modify Sequence File. Tez is also an external library which few Hive
 developers can change.
 
 So:
 
 - I suggest we document it as one of the built-in SerDes and not as a
 native format like here:
 https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
 - I vote for the jira to say Add parquet SerDes to Hive and not Native
 support
 
 The change provides the ability to create a parquet table with Hive,
 natively. Therefore I don't see the issue you have with the word
 native.
 
 - I think we should revert the change to the grammar to allow STORED AS
 PARQUET until we have a mechanism to do that for all SerDes, i.e.:
 someone
 picks up: HIVE-5976. (I also don't think this actually works properly
 unless we bundle parquet in hive-exec, which I don't think we want.)
 
 Again, you could have provided this feedback many moons ago.

Re: Parquet support (HIVE-5783)

2014-02-18 Thread Brock Noland

Hi Alan,

Response is inline, below:

On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates ga...@hortonworks.com wrote:
 Gunther, is it the case that there is anything extra that needs to be done to 
 ship Parquet code with Hive right now?  If I read the patch correctly the 
 Parquet jars were added to the pom and thus will be shipped as part of Hive.  
 As long as it works out of the box when a user says create table ... stored 
 as parquet why do we care whether the parquet jar is owned by Hive or 
 another project?

 The concern about feature mismatch in Parquet versus Hive is valid, but I'm 
 not sure what to do about it other than assure that there are good error 
 messages.  Users will often want to use non-Hive based storage formats 
 (Parquet, Avro, etc.).  This means we need a good way to detect at SQL 
 compile time that the underlying storage doesn't support the indicated data 
 type and throw a good error.

Agreed, the error messages should absolutely be good. I will ensure
this is the case via https://issues.apache.org/jira/browse/HIVE-6457


 Also, it's important to be clear going forward about what Hive as a project 
 is signing up for.  If tomorrow someone decides to add a new datatype or 
 feature we need to be clear that we expect the contributor to make this work 
 for Hive owned formats (text, RC, sequence, ORC) but not necessarily for 
 external formats

This makes sense to me.

I'd just like to add that I have a patch available to improve the
hive-exec uber jar and general query speed:
https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a
patch available to finish the generic STORED AS functionality:
https://issues.apache.org/jira/browse/HIVE-5976

Brock

Re: Parquet support (HIVE-5783)

2014-02-17 Thread Gunther Hagleitner

Brock,

I'm not trying to pick winners, I'm merely trying to say that the
documentation/code should match what's actually there, so folks can make
informed decisions.

The issue I have with the word native is that people have expectations
when they hear it and I think these are not met.

I've had folks ask me why we're switching the default of hive to Parquet.
This isn't the case obviously, but native to most people means just that:
Hive's primary format. That's why I was asking for a title of Add Parquet
SerDe for the jira. That's the exact same thing that was done for Avro
under the exact same circumstances:
https://issues.apache.org/jira/browse/HIVE-895.

Native also has other associations a) it supports the full data
model/feature set and b) it's part of hive. Neither is the case and I don't
think that's just a superficial difference. Support and usability will be
different. That's why I think the documentation should delineate between
RC/ORC/etc on one side and Parquet/Avro/etc on the other.

As mentioned in the jira STORED AS was reserved for what's actually part
of hive (or hadoop core in the case of sequence file as you point out). I
think there are reasons for that: a) being part of the grammar implies
native as above b) you need to ship the code bundled in hive-exec for this
to work (which is *broken* right now) and c) like you said we shouldn't
pick winners by letting some of them become a keyword and others not. For
these reasons I think Parquet should use the old syntax at this point. If
you have a pluggable/configurable way great, but right now we don't have
that.

Finally, yes, I am late to this party and I apologize for that. I'm happy
to make the suggested changes myself, if that's the concern.

Thanks,
Gunther.



On Sun, Feb 16, 2014 at 7:40 PM, Brock Noland br...@cloudera.com wrote:

 Hi Gunther,

 Please find my response inline.

 On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner gunt...@apache.org
 wrote:
  I read through the ticket, patch and documentation

 Thank you very much for reading through these items!

  and would like to
  suggest some changes.

 There was ample time to suggest these changes prior to commit. The
 JIRA was created three months ago, and the title you object to and the
 patch was up there over two months ago.

  As far as I can tell this basically adds parquet SerDes to hive, but the
  file format remains external to hive. There is no way for hive devs to
  makes changes, fix bugs add, change datatypes, add features to parquet
  itself.

 As stated in many locations including the JIRA discussed here, we
 shouldn't be picking winner/loser file formats. We use many external
 libraries, none of which, all Hive developers have the ability to
 modify. For example most Hive developers do not have the ability to
 modify Sequence File. Tez is also an external library which few Hive
 developers can change.

  So:
 
  - I suggest we document it as one of the built-in SerDes and not as a
  native format like here:
  https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
  https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
  - I vote for the jira to say Add parquet SerDes to Hive and not Native
  support

 The change provides the ability to create a parquet table with Hive,
 natively. Therefore I don't see the issue you have with the word
 native.

  - I think we should revert the change to the grammar to allow STORED AS
  PARQUET until we have a mechanism to do that for all SerDes, i.e.:
 someone
  picks up: HIVE-5976. (I also don't think this actually works properly
  unless we bundle parquet in hive-exec, which I don't think we want.)

 Again, you could have provided this feedback many moons ago. I am
 personally interested in HIVE-5976 but it's orthogonal to this issue.
 That change just makes it easier and cleaner to add STORED AS
 keywords. The contributors of the Parquet integration are not required
 to fix Hive. That is our job.

  - We should revert the deprecated classes (At least I don't understand
 how
  a first drop needs to add deprecated stuff)

 The deprecated classes are shells (no actual code) to support existing
 users of Parquet, of which there are many. I see no justification for
 impacting existing users when the workaround is trivial and
 non-impacting to any other user.

  In general though, I'm also confused on why adding this SerDe to the hive
  code base is beneficial. Seems to me that that just makes upgrading
  Parquet, bug fixing, etc more difficult by tying a SerDe release to a
 Hive
  release. To me that outweighs the benefit of a slightly more involved
 setup
  of Hive + serde in the cluster.

 The Hive APIs, which are not clearly defined, have changed often in
 the past few releases making maintaining a file format extremely
 difficult. For example, 0.12 and 0.13 break most if not all external
 code bases.

 However, beyond that, the community felt it was beneficial to make
 Parquet easier to use. If you are

Re: Parquet support (HIVE-5783)

2014-02-16 Thread Brock Noland

Hi Gunther,

Please find my response inline.

On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner gunt...@apache.org wrote:
 I read through the ticket, patch and documentation

Thank you very much for reading through these items!

 and would like to
 suggest some changes.

There was ample time to suggest these changes prior to commit. The
JIRA was created three months ago, and the title you object to and the
patch was up there over two months ago.

 As far as I can tell this basically adds parquet SerDes to hive, but the
 file format remains external to hive. There is no way for hive devs to
 makes changes, fix bugs add, change datatypes, add features to parquet
 itself.

As stated in many locations including the JIRA discussed here, we
shouldn't be picking winner/loser file formats. We use many external
libraries, none of which, all Hive developers have the ability to
modify. For example most Hive developers do not have the ability to
modify Sequence File. Tez is also an external library which few Hive
developers can change.

 So:

 - I suggest we document it as one of the built-in SerDes and not as a
 native format like here:
 https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
 - I vote for the jira to say Add parquet SerDes to Hive and not Native
 support

The change provides the ability to create a parquet table with Hive,
natively. Therefore I don't see the issue you have with the word
native.

 - I think we should revert the change to the grammar to allow STORED AS
 PARQUET until we have a mechanism to do that for all SerDes, i.e.: someone
 picks up: HIVE-5976. (I also don't think this actually works properly
 unless we bundle parquet in hive-exec, which I don't think we want.)

Again, you could have provided this feedback many moons ago. I am
personally interested in HIVE-5976 but it's orthogonal to this issue.
That change just makes it easier and cleaner to add STORED AS
keywords. The contributors of the Parquet integration are not required
to fix Hive. That is our job.

 - We should revert the deprecated classes (At least I don't understand how
 a first drop needs to add deprecated stuff)

The deprecated classes are shells (no actual code) to support existing
users of Parquet, of which there are many. I see no justification for
impacting existing users when the workaround is trivial and
non-impacting to any other user.

 In general though, I'm also confused on why adding this SerDe to the hive
 code base is beneficial. Seems to me that that just makes upgrading
 Parquet, bug fixing, etc more difficult by tying a SerDe release to a Hive
 release. To me that outweighs the benefit of a slightly more involved setup
 of Hive + serde in the cluster.

The Hive APIs, which are not clearly defined, have changed often in
the past few releases making maintaining a file format extremely
difficult. For example, 0.12 and 0.13 break most if not all external
code bases.

However, beyond that, the community felt it was beneficial to make
Parquet easier to use. If you are not interested in Parquet then
ignore it as this change does not impact you. Tez integration is
something which does not interest myself and many other Hive
developers. Indeed other than a few cursory reviews and a few times
where I championed the refactoring you guys were doing in order to
support Tez, I have ignored the Tez work.

Sincerely,
Brock

Parquet support (HIVE-5783)

2014-02-15 Thread Gunther Hagleitner

I read through the ticket, patch and documentation and would like to
suggest some changes.

As far as I can tell this basically adds parquet SerDes to hive, but the
file format remains external to hive. There is no way for hive devs to
makes changes, fix bugs add, change datatypes, add features to parquet
itself.

So:

- I suggest we document it as one of the built-in SerDes and not as a
native format like here:
https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
- I vote for the jira to say Add parquet SerDes to Hive and not Native
support
- I think we should revert the change to the grammar to allow STORED AS
PARQUET until we have a mechanism to do that for all SerDes, i.e.: someone
picks up: HIVE-5976. (I also don't think this actually works properly
unless we bundle parquet in hive-exec, which I don't think we want.)
- We should revert the deprecated classes (At least I don't understand how
a first drop needs to add deprecated stuff)

In general though, I'm also confused on why adding this SerDe to the hive
code base is beneficial. Seems to me that that just makes upgrading
Parquet, bug fixing, etc more difficult by tying a SerDe release to a Hive
release. To me that outweighs the benefit of a slightly more involved setup
of Hive + serde in the cluster.

Thanks,
Gunther.

Re: Parquet support (HIVE-5783)

Re: Parquet support (HIVE-5783)

Re: Parquet support (HIVE-5783)

Re: Parquet support (HIVE-5783)

Re: Parquet support (HIVE-5783)

Re: Parquet support (HIVE-5783)

Re: Parquet support (HIVE-5783)

Parquet support (HIVE-5783)

8 matches

Site Navigation

Mail list logo

Footer information