Thank you all for your detailed responses. Let me make sure I have this right:

I can write the Parquet file in any way I want, including using our own custom 
Thrift code. Hive does not care, because it will used the schema stored in the 
Parquet file together with the schema I specified when creating the table 
(stored in the Hive Metastore) to read in the Parquet data into its own 
in-memory Hive format (what is this Hive format by the way? Where are the 
classes it uses?). This is why I can simply do `STORED AS PARQUET` and 
everything should work.

This would be great, except that I have two requirements I have to meet:
1) We need to be able to rename, delete, and add thrift fields, and continue to 
be able to read from Parquet files/tables created with the old format, without 
any alterations to the table.
2) The user should not have to specify the table columns when creating a Hive 
table. This information should be figured out by looking at the Thrift class 
from which the file was generated.

According to what Julien said, 1) is not possible: I have to pick two out of 
three ways for the schema to evolve. I don’t quite understand why, if we use 
lookup by index by setting the mentioned configuration property, we can not 
remove fields. What do you mean “it would change subsequent indices”? For 
example, let’s say we have a Thrift struct with field ids 1, 2, 3. What happens 
if we remove field 2?

For 2), I think I could just use the normal `STORED AS PARQUET` but set a 
metadata/serde property on the table that points to the Thrift class, which 
would allow us to automatically generate the column names + types. Does that 
seem reasonable?

— Stephen

Reply via email to