Thank you all for your detailed responses. Let me make sure I have this right:
I can write the Parquet file in any way I want, including using our own custom Thrift code. Hive does not care, because it will used the schema stored in the Parquet file together with the schema I specified when creating the table (stored in the Hive Metastore) to read in the Parquet data into its own in-memory Hive format (what is this Hive format by the way? Where are the classes it uses?). This is why I can simply do `STORED AS PARQUET` and everything should work. This would be great, except that I have two requirements I have to meet: 1) We need to be able to rename, delete, and add thrift fields, and continue to be able to read from Parquet files/tables created with the old format, without any alterations to the table. 2) The user should not have to specify the table columns when creating a Hive table. This information should be figured out by looking at the Thrift class from which the file was generated. According to what Julien said, 1) is not possible: I have to pick two out of three ways for the schema to evolve. I don’t quite understand why, if we use lookup by index by setting the mentioned configuration property, we can not remove fields. What do you mean “it would change subsequent indices”? For example, let’s say we have a Thrift struct with field ids 1, 2, 3. What happens if we remove field 2? For 2), I think I could just use the normal `STORED AS PARQUET` but set a metadata/serde property on the table that points to the Thrift class, which would allow us to automatically generate the column names + types. Does that seem reasonable? — Stephen
