Re: Parquet support (HIVE-5783)
Hi, Storage handlers muddle the waters a bit IMO. That interface was written for storage that is not file-based, e.g. hbase. Whereas Avro, Parquet, Sequence File, etc are all file based. I think we have to be practical about confusion. There are so many Hadoop newbies out there, almost all of them new to Apache as well, that there is going to be some confusion. For example, one person who had been using Hadoop and Hive for a few months said to me Hive moved *from* Apache to Hortonworks. At the end of the day, regardless of what we do, some level of confusion is going to persist amongst those new to the ecosystem. With that said, I do think that an overview of Hive Storage would be a great addition to our documentation. Brock On Fri, Feb 21, 2014 at 1:27 AM, Lefty Leverenz leftylever...@gmail.com wrote: This is in the Terminology sectionhttps://cwiki.apache.org/confluence/display/Hive/StorageHandlers#StorageHandlers-Terminology of the Storage Handlers doc: Storage handlers introduce a distinction between *native* and *non-native* tables. A native table is one which Hive knows how to manage and access without a storage handler; a non-native table is one which requires a storage handler. It goes on to say that non-native tables are created with a STORED BY clause (as opposed to a STORED AS clause). Does that clarify or muddy the waters? -- Lefty On Thu, Feb 20, 2014 at 7:37 PM, Lefty Leverenz leftylever...@gmail.comwrote: Some of these issues can be addressed in the documentation. The File Formats section of the Language Manual needs an overview, and that might be a good place to explain the differences between Hive-owned formats and external formats. Or the SerDe doc could be beefed up: Built-In SerDeshttps://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inSerDes . In the meantime, I've added a link to the Avro doc in the File Formats list and mentioned Parquet in DDL's Row Format, Storage Format, and SerDehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,StorageFormat,andSerDesection: Use STORED AS PARQUET (without ROW FORMAT SERDE) for the Parquethttps://cwiki.apache.org/confluence/display/Hive/Parquet columnar storage format in Hive 0.13.0 and laterhttps://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.13andlater; or use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT ... in Hive 0.10, 0.11, or 0.12https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0.12 . Does that work? -- Lefty On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland br...@cloudera.com wrote: Hi Alan, Response is inline, below: On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates ga...@hortonworks.com wrote: Gunther, is it the case that there is anything extra that needs to be done to ship Parquet code with Hive right now? If I read the patch correctly the Parquet jars were added to the pom and thus will be shipped as part of Hive. As long as it works out of the box when a user says create table ... stored as parquet why do we care whether the parquet jar is owned by Hive or another project? The concern about feature mismatch in Parquet versus Hive is valid, but I'm not sure what to do about it other than assure that there are good error messages. Users will often want to use non-Hive based storage formats (Parquet, Avro, etc.). This means we need a good way to detect at SQL compile time that the underlying storage doesn't support the indicated data type and throw a good error. Agreed, the error messages should absolutely be good. I will ensure this is the case via https://issues.apache.org/jira/browse/HIVE-6457 Also, it's important to be clear going forward about what Hive as a project is signing up for. If tomorrow someone decides to add a new datatype or feature we need to be clear that we expect the contributor to make this work for Hive owned formats (text, RC, sequence, ORC) but not necessarily for external formats This makes sense to me. I'd just like to add that I have a patch available to improve the hive-exec uber jar and general query speed: https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a patch available to finish the generic STORED AS functionality: https://issues.apache.org/jira/browse/HIVE-5976 Brock -- Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
Re: Parquet support (HIVE-5783)
Some of these issues can be addressed in the documentation. The File Formats section of the Language Manual needs an overview, and that might be a good place to explain the differences between Hive-owned formats and external formats. Or the SerDe doc could be beefed up: Built-In SerDeshttps://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inSerDes . In the meantime, I've added a link to the Avro doc in the File Formats list and mentioned Parquet in DDL's Row Format, Storage Format, and SerDehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,StorageFormat,andSerDesection: Use STORED AS PARQUET (without ROW FORMAT SERDE) for the Parquethttps://cwiki.apache.org/confluence/display/Hive/Parquet columnar storage format in Hive 0.13.0 and laterhttps://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.13andlater; or use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT ... in Hive 0.10, 0.11, or 0.12https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0.12 . Does that work? -- Lefty On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland br...@cloudera.com wrote: Hi Alan, Response is inline, below: On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates ga...@hortonworks.com wrote: Gunther, is it the case that there is anything extra that needs to be done to ship Parquet code with Hive right now? If I read the patch correctly the Parquet jars were added to the pom and thus will be shipped as part of Hive. As long as it works out of the box when a user says create table ... stored as parquet why do we care whether the parquet jar is owned by Hive or another project? The concern about feature mismatch in Parquet versus Hive is valid, but I'm not sure what to do about it other than assure that there are good error messages. Users will often want to use non-Hive based storage formats (Parquet, Avro, etc.). This means we need a good way to detect at SQL compile time that the underlying storage doesn't support the indicated data type and throw a good error. Agreed, the error messages should absolutely be good. I will ensure this is the case via https://issues.apache.org/jira/browse/HIVE-6457 Also, it's important to be clear going forward about what Hive as a project is signing up for. If tomorrow someone decides to add a new datatype or feature we need to be clear that we expect the contributor to make this work for Hive owned formats (text, RC, sequence, ORC) but not necessarily for external formats This makes sense to me. I'd just like to add that I have a patch available to improve the hive-exec uber jar and general query speed: https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a patch available to finish the generic STORED AS functionality: https://issues.apache.org/jira/browse/HIVE-5976 Brock
Re: Parquet support (HIVE-5783)
This is in the Terminology sectionhttps://cwiki.apache.org/confluence/display/Hive/StorageHandlers#StorageHandlers-Terminology of the Storage Handlers doc: Storage handlers introduce a distinction between *native* and *non-native* tables. A native table is one which Hive knows how to manage and access without a storage handler; a non-native table is one which requires a storage handler. It goes on to say that non-native tables are created with a STORED BY clause (as opposed to a STORED AS clause). Does that clarify or muddy the waters? -- Lefty On Thu, Feb 20, 2014 at 7:37 PM, Lefty Leverenz leftylever...@gmail.comwrote: Some of these issues can be addressed in the documentation. The File Formats section of the Language Manual needs an overview, and that might be a good place to explain the differences between Hive-owned formats and external formats. Or the SerDe doc could be beefed up: Built-In SerDeshttps://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inSerDes . In the meantime, I've added a link to the Avro doc in the File Formats list and mentioned Parquet in DDL's Row Format, Storage Format, and SerDehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,StorageFormat,andSerDesection: Use STORED AS PARQUET (without ROW FORMAT SERDE) for the Parquethttps://cwiki.apache.org/confluence/display/Hive/Parquet columnar storage format in Hive 0.13.0 and laterhttps://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.13andlater; or use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT ... in Hive 0.10, 0.11, or 0.12https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0.12 . Does that work? -- Lefty On Tue, Feb 18, 2014 at 1:31 PM, Brock Noland br...@cloudera.com wrote: Hi Alan, Response is inline, below: On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates ga...@hortonworks.com wrote: Gunther, is it the case that there is anything extra that needs to be done to ship Parquet code with Hive right now? If I read the patch correctly the Parquet jars were added to the pom and thus will be shipped as part of Hive. As long as it works out of the box when a user says create table ... stored as parquet why do we care whether the parquet jar is owned by Hive or another project? The concern about feature mismatch in Parquet versus Hive is valid, but I'm not sure what to do about it other than assure that there are good error messages. Users will often want to use non-Hive based storage formats (Parquet, Avro, etc.). This means we need a good way to detect at SQL compile time that the underlying storage doesn't support the indicated data type and throw a good error. Agreed, the error messages should absolutely be good. I will ensure this is the case via https://issues.apache.org/jira/browse/HIVE-6457 Also, it's important to be clear going forward about what Hive as a project is signing up for. If tomorrow someone decides to add a new datatype or feature we need to be clear that we expect the contributor to make this work for Hive owned formats (text, RC, sequence, ORC) but not necessarily for external formats This makes sense to me. I'd just like to add that I have a patch available to improve the hive-exec uber jar and general query speed: https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a patch available to finish the generic STORED AS functionality: https://issues.apache.org/jira/browse/HIVE-5976 Brock
Re: Parquet support (HIVE-5783)
Gunther, is it the case that there is anything extra that needs to be done to ship Parquet code with Hive right now? If I read the patch correctly the Parquet jars were added to the pom and thus will be shipped as part of Hive. As long as it works out of the box when a user says “create table … stored as parquet” why do we care whether the parquet jar is owned by Hive or another project? The concern about feature mismatch in Parquet versus Hive is valid, but I’m not sure what to do about it other than assure that there are good error messages. Users will often want to use non-Hive based storage formats (Parquet, Avro, etc.). This means we need a good way to detect at SQL compile time that the underlying storage doesn’t support the indicated data type and throw a good error. Also, it’s important to be clear going forward about what Hive as a project is signing up for. If tomorrow someone decides to add a new datatype or feature we need to be clear that we expect the contributor to make this work for Hive owned formats (text, RC, sequence, ORC) but not necessarily for external formats (Parquet, Avro). Alan. On Feb 17, 2014, at 7:03 PM, Gunther Hagleitner ghagleit...@hortonworks.com wrote: Brock, I'm not trying to pick winners, I'm merely trying to say that the documentation/code should match what's actually there, so folks can make informed decisions. The issue I have with the word native is that people have expectations when they hear it and I think these are not met. I've had folks ask me why we're switching the default of hive to Parquet. This isn't the case obviously, but native to most people means just that: Hive's primary format. That's why I was asking for a title of Add Parquet SerDe for the jira. That's the exact same thing that was done for Avro under the exact same circumstances: https://issues.apache.org/jira/browse/HIVE-895. Native also has other associations a) it supports the full data model/feature set and b) it's part of hive. Neither is the case and I don't think that's just a superficial difference. Support and usability will be different. That's why I think the documentation should delineate between RC/ORC/etc on one side and Parquet/Avro/etc on the other. As mentioned in the jira STORED AS was reserved for what's actually part of hive (or hadoop core in the case of sequence file as you point out). I think there are reasons for that: a) being part of the grammar implies native as above b) you need to ship the code bundled in hive-exec for this to work (which is *broken* right now) and c) like you said we shouldn't pick winners by letting some of them become a keyword and others not. For these reasons I think Parquet should use the old syntax at this point. If you have a pluggable/configurable way great, but right now we don't have that. Finally, yes, I am late to this party and I apologize for that. I'm happy to make the suggested changes myself, if that's the concern. Thanks, Gunther. On Sun, Feb 16, 2014 at 7:40 PM, Brock Noland br...@cloudera.com wrote: Hi Gunther, Please find my response inline. On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner gunt...@apache.org wrote: I read through the ticket, patch and documentation Thank you very much for reading through these items! and would like to suggest some changes. There was ample time to suggest these changes prior to commit. The JIRA was created three months ago, and the title you object to and the patch was up there over two months ago. As far as I can tell this basically adds parquet SerDes to hive, but the file format remains external to hive. There is no way for hive devs to makes changes, fix bugs add, change datatypes, add features to parquet itself. As stated in many locations including the JIRA discussed here, we shouldn't be picking winner/loser file formats. We use many external libraries, none of which, all Hive developers have the ability to modify. For example most Hive developers do not have the ability to modify Sequence File. Tez is also an external library which few Hive developers can change. So: - I suggest we document it as one of the built-in SerDes and not as a native format like here: https://cwiki.apache.org/confluence/display/Hive/Parquet (and here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual) - I vote for the jira to say Add parquet SerDes to Hive and not Native support The change provides the ability to create a parquet table with Hive, natively. Therefore I don't see the issue you have with the word native. - I think we should revert the change to the grammar to allow STORED AS PARQUET until we have a mechanism to do that for all SerDes, i.e.: someone picks up: HIVE-5976. (I also don't think this actually works properly unless we bundle parquet in hive-exec, which I don't think we want.) Again, you could have provided this feedback many moons ago.
Re: Parquet support (HIVE-5783)
Hi Alan, Response is inline, below: On Tue, Feb 18, 2014 at 11:49 AM, Alan Gates ga...@hortonworks.com wrote: Gunther, is it the case that there is anything extra that needs to be done to ship Parquet code with Hive right now? If I read the patch correctly the Parquet jars were added to the pom and thus will be shipped as part of Hive. As long as it works out of the box when a user says create table ... stored as parquet why do we care whether the parquet jar is owned by Hive or another project? The concern about feature mismatch in Parquet versus Hive is valid, but I'm not sure what to do about it other than assure that there are good error messages. Users will often want to use non-Hive based storage formats (Parquet, Avro, etc.). This means we need a good way to detect at SQL compile time that the underlying storage doesn't support the indicated data type and throw a good error. Agreed, the error messages should absolutely be good. I will ensure this is the case via https://issues.apache.org/jira/browse/HIVE-6457 Also, it's important to be clear going forward about what Hive as a project is signing up for. If tomorrow someone decides to add a new datatype or feature we need to be clear that we expect the contributor to make this work for Hive owned formats (text, RC, sequence, ORC) but not necessarily for external formats This makes sense to me. I'd just like to add that I have a patch available to improve the hive-exec uber jar and general query speed: https://issues.apache.org/jira/browse/HIVE-860. Additionally I have a patch available to finish the generic STORED AS functionality: https://issues.apache.org/jira/browse/HIVE-5976 Brock
Re: Parquet support (HIVE-5783)
Brock, I'm not trying to pick winners, I'm merely trying to say that the documentation/code should match what's actually there, so folks can make informed decisions. The issue I have with the word native is that people have expectations when they hear it and I think these are not met. I've had folks ask me why we're switching the default of hive to Parquet. This isn't the case obviously, but native to most people means just that: Hive's primary format. That's why I was asking for a title of Add Parquet SerDe for the jira. That's the exact same thing that was done for Avro under the exact same circumstances: https://issues.apache.org/jira/browse/HIVE-895. Native also has other associations a) it supports the full data model/feature set and b) it's part of hive. Neither is the case and I don't think that's just a superficial difference. Support and usability will be different. That's why I think the documentation should delineate between RC/ORC/etc on one side and Parquet/Avro/etc on the other. As mentioned in the jira STORED AS was reserved for what's actually part of hive (or hadoop core in the case of sequence file as you point out). I think there are reasons for that: a) being part of the grammar implies native as above b) you need to ship the code bundled in hive-exec for this to work (which is *broken* right now) and c) like you said we shouldn't pick winners by letting some of them become a keyword and others not. For these reasons I think Parquet should use the old syntax at this point. If you have a pluggable/configurable way great, but right now we don't have that. Finally, yes, I am late to this party and I apologize for that. I'm happy to make the suggested changes myself, if that's the concern. Thanks, Gunther. On Sun, Feb 16, 2014 at 7:40 PM, Brock Noland br...@cloudera.com wrote: Hi Gunther, Please find my response inline. On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner gunt...@apache.org wrote: I read through the ticket, patch and documentation Thank you very much for reading through these items! and would like to suggest some changes. There was ample time to suggest these changes prior to commit. The JIRA was created three months ago, and the title you object to and the patch was up there over two months ago. As far as I can tell this basically adds parquet SerDes to hive, but the file format remains external to hive. There is no way for hive devs to makes changes, fix bugs add, change datatypes, add features to parquet itself. As stated in many locations including the JIRA discussed here, we shouldn't be picking winner/loser file formats. We use many external libraries, none of which, all Hive developers have the ability to modify. For example most Hive developers do not have the ability to modify Sequence File. Tez is also an external library which few Hive developers can change. So: - I suggest we document it as one of the built-in SerDes and not as a native format like here: https://cwiki.apache.org/confluence/display/Hive/Parquet (and here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual) - I vote for the jira to say Add parquet SerDes to Hive and not Native support The change provides the ability to create a parquet table with Hive, natively. Therefore I don't see the issue you have with the word native. - I think we should revert the change to the grammar to allow STORED AS PARQUET until we have a mechanism to do that for all SerDes, i.e.: someone picks up: HIVE-5976. (I also don't think this actually works properly unless we bundle parquet in hive-exec, which I don't think we want.) Again, you could have provided this feedback many moons ago. I am personally interested in HIVE-5976 but it's orthogonal to this issue. That change just makes it easier and cleaner to add STORED AS keywords. The contributors of the Parquet integration are not required to fix Hive. That is our job. - We should revert the deprecated classes (At least I don't understand how a first drop needs to add deprecated stuff) The deprecated classes are shells (no actual code) to support existing users of Parquet, of which there are many. I see no justification for impacting existing users when the workaround is trivial and non-impacting to any other user. In general though, I'm also confused on why adding this SerDe to the hive code base is beneficial. Seems to me that that just makes upgrading Parquet, bug fixing, etc more difficult by tying a SerDe release to a Hive release. To me that outweighs the benefit of a slightly more involved setup of Hive + serde in the cluster. The Hive APIs, which are not clearly defined, have changed often in the past few releases making maintaining a file format extremely difficult. For example, 0.12 and 0.13 break most if not all external code bases. However, beyond that, the community felt it was beneficial to make Parquet easier to use. If you are
Re: Parquet support (HIVE-5783)
Hi Gunther, Please find my response inline. On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner gunt...@apache.org wrote: I read through the ticket, patch and documentation Thank you very much for reading through these items! and would like to suggest some changes. There was ample time to suggest these changes prior to commit. The JIRA was created three months ago, and the title you object to and the patch was up there over two months ago. As far as I can tell this basically adds parquet SerDes to hive, but the file format remains external to hive. There is no way for hive devs to makes changes, fix bugs add, change datatypes, add features to parquet itself. As stated in many locations including the JIRA discussed here, we shouldn't be picking winner/loser file formats. We use many external libraries, none of which, all Hive developers have the ability to modify. For example most Hive developers do not have the ability to modify Sequence File. Tez is also an external library which few Hive developers can change. So: - I suggest we document it as one of the built-in SerDes and not as a native format like here: https://cwiki.apache.org/confluence/display/Hive/Parquet (and here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual) - I vote for the jira to say Add parquet SerDes to Hive and not Native support The change provides the ability to create a parquet table with Hive, natively. Therefore I don't see the issue you have with the word native. - I think we should revert the change to the grammar to allow STORED AS PARQUET until we have a mechanism to do that for all SerDes, i.e.: someone picks up: HIVE-5976. (I also don't think this actually works properly unless we bundle parquet in hive-exec, which I don't think we want.) Again, you could have provided this feedback many moons ago. I am personally interested in HIVE-5976 but it's orthogonal to this issue. That change just makes it easier and cleaner to add STORED AS keywords. The contributors of the Parquet integration are not required to fix Hive. That is our job. - We should revert the deprecated classes (At least I don't understand how a first drop needs to add deprecated stuff) The deprecated classes are shells (no actual code) to support existing users of Parquet, of which there are many. I see no justification for impacting existing users when the workaround is trivial and non-impacting to any other user. In general though, I'm also confused on why adding this SerDe to the hive code base is beneficial. Seems to me that that just makes upgrading Parquet, bug fixing, etc more difficult by tying a SerDe release to a Hive release. To me that outweighs the benefit of a slightly more involved setup of Hive + serde in the cluster. The Hive APIs, which are not clearly defined, have changed often in the past few releases making maintaining a file format extremely difficult. For example, 0.12 and 0.13 break most if not all external code bases. However, beyond that, the community felt it was beneficial to make Parquet easier to use. If you are not interested in Parquet then ignore it as this change does not impact you. Tez integration is something which does not interest myself and many other Hive developers. Indeed other than a few cursory reviews and a few times where I championed the refactoring you guys were doing in order to support Tez, I have ignored the Tez work. Sincerely, Brock
Parquet support (HIVE-5783)
I read through the ticket, patch and documentation and would like to suggest some changes. As far as I can tell this basically adds parquet SerDes to hive, but the file format remains external to hive. There is no way for hive devs to makes changes, fix bugs add, change datatypes, add features to parquet itself. So: - I suggest we document it as one of the built-in SerDes and not as a native format like here: https://cwiki.apache.org/confluence/display/Hive/Parquet (and here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual) - I vote for the jira to say Add parquet SerDes to Hive and not Native support - I think we should revert the change to the grammar to allow STORED AS PARQUET until we have a mechanism to do that for all SerDes, i.e.: someone picks up: HIVE-5976. (I also don't think this actually works properly unless we bundle parquet in hive-exec, which I don't think we want.) - We should revert the deprecated classes (At least I don't understand how a first drop needs to add deprecated stuff) In general though, I'm also confused on why adding this SerDe to the hive code base is beneficial. Seems to me that that just makes upgrading Parquet, bug fixing, etc more difficult by tying a SerDe release to a Hive release. To me that outweighs the benefit of a slightly more involved setup of Hive + serde in the cluster. Thanks, Gunther.