Re: Creating Hive tables on Parquet data with a custom Thrift generator

Ryan Blue Mon, 07 Dec 2015 11:38:28 -0800

On 12/07/2015 11:21 AM, Stephen Bly wrote:

Thank you all for your detailed responses. Let me make sure I have this right:


I can write the Parquet file in any way I want, including using our own custom 
Thrift code. Hive does not care, because it will used the schema stored in the 
Parquet file together with the schema I specified when creating the table 
(stored in the Hive Metastore) to read in the Parquet data into its own 
in-memory Hive format (what is this Hive format by the way? Where are the 
classes it uses?). This is why I can simply do `STORED AS PARQUET` and 
everything should work.

This would be great, except that I have two requirements I have to meet:
1) We need to be able to rename, delete, and add thrift fields, and continue to 
be able to read from Parquet files/tables created with the old format, without 
any alterations to the table.
2) The user should not have to specify the table columns when creating a Hive 
table. This information should be figured out by looking at the Thrift class 
from which the file was generated.

According to what Julien said, 1) is not possible: I have to pick two out of 
three ways for the schema to evolve. I don’t quite understand why, if we use 
lookup by index by setting the mentioned configuration property, we can not 
remove fields. What do you mean “it would change subsequent indices”? For 
example, let’s say we have a Thrift struct with field ids 1, 2, 3. What happens 
if we remove field 2?

We can (and should) add support for this eventually, but it isn't donetoday. Contributions in this area are certainly welcome!

For 2), I think I could just use the normal `STORED AS PARQUET` but set a 
metadata/serde property on the table that points to the Thrift class, which 
would allow us to automatically generate the column names + types. Does that 
seem reasonable?

I think it is easier and more general than that. Thrift already createsthe Parquet schema, so you should be able to create a table definitionfrom a Parquet file without even worrying about the original Thriftclass. There are a couple of ways to do this, but none that I know ofwithin Hive.

First, the Kite command-line tools can create a Hive table from Parquetfiles by pointing at a directory of data. That will inspect all files'schemas, union them, and convert the result to a Hive table definition.It does this by using an intermediate Avro schema, but it should workfor you just fine.

Second, Impala has a SQL command, CREATE TABLE LIKE FILE <f.parquet>,that will inspect a single file and give you a table definition that canread it.


rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Creating Hive tables on Parquet data with a custom Thrift generator

Reply via email to