On 12/07/2015 11:21 AM, Stephen Bly wrote:
Thank you all for your detailed responses. Let me make sure I have this right:

I can write the Parquet file in any way I want, including using our own custom 
Thrift code. Hive does not care, because it will used the schema stored in the 
Parquet file together with the schema I specified when creating the table 
(stored in the Hive Metastore) to read in the Parquet data into its own 
in-memory Hive format (what is this Hive format by the way? Where are the 
classes it uses?). This is why I can simply do `STORED AS PARQUET` and 
everything should work.

This would be great, except that I have two requirements I have to meet:
1) We need to be able to rename, delete, and add thrift fields, and continue to 
be able to read from Parquet files/tables created with the old format, without 
any alterations to the table.
2) The user should not have to specify the table columns when creating a Hive 
table. This information should be figured out by looking at the Thrift class 
from which the file was generated.

According to what Julien said, 1) is not possible: I have to pick two out of 
three ways for the schema to evolve. I don’t quite understand why, if we use 
lookup by index by setting the mentioned configuration property, we can not 
remove fields. What do you mean “it would change subsequent indices”? For 
example, let’s say we have a Thrift struct with field ids 1, 2, 3. What happens 
if we remove field 2?

We can (and should) add support for this eventually, but it isn't done today. Contributions in this area are certainly welcome!

For 2), I think I could just use the normal `STORED AS PARQUET` but set a 
metadata/serde property on the table that points to the Thrift class, which 
would allow us to automatically generate the column names + types. Does that 
seem reasonable?

I think it is easier and more general than that. Thrift already creates the Parquet schema, so you should be able to create a table definition from a Parquet file without even worrying about the original Thrift class. There are a couple of ways to do this, but none that I know of within Hive.

First, the Kite command-line tools can create a Hive table from Parquet files by pointing at a directory of data. That will inspect all files' schemas, union them, and convert the result to a Hive table definition. It does this by using an intermediate Avro schema, but it should work for you just fine.

Second, Impala has a SQL command, CREATE TABLE LIKE FILE <f.parquet>, that will inspect a single file and give you a table definition that can read it.

rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to