To follow-up, after discussing with more senior engineers at my company:

I misread what Julien said in regards to accessing by column index. I thought 
this was equivalent to Thrift ID, but now I understand what he actually meant, 
and that solution is unfortunately not viable for our use case.

If you read this issue (https://github.com/Parquet/parquet-format/issues/91 
<https://github.com/Parquet/parquet-format/issues/91>), what we want is 
solution #3,but the issue is still open and it looks like that approach was 
never implemented. So I’m going to have to add add code that does essentially 
that :D

> I think it is easier and more general than that. Thrift already creates the 
> Parquet schema, so you should be able to create a table definition from a > 
> Parquet file without even worrying about the original Thrift class. There are 
> a couple of ways to do this, but none that I know of within Hive.

We COULD create a Hive table schema from the Parquet metadata, but that data 
could be out of date. We want to always use the most up-to-date Thrift schema.

Thank you all for you help, I think I know what I need to do now. At some point 
maybe I can contribute to the Parquet project to allow Hive to access columns 
by looking at the stored ID field instead of the field name.

Reply via email to