Greetings Parquet experts. I am in need of a little help. I am a (very) Junior developer at my company, and I have been tasked with adding the Parquet file format to our Hadoop ecosystem. Our main use case is in creating Hive tables on Parquet data and querying them.
As you know, Hive can create Parquet tables with the STORED AS PARQUET command. However, we use a custom Thrift generator to generate Scala code (similar to Twitter’s Scrooge, if you’re familiar with that). Thus, I am not sure if the above command will work. I tested it out and it is hit and miss -- I'm able to create tables, but often get errors when querying them that I am still investigating. Hive allows you to be more custom by using INPUT FORMAT ... OUTPUT FORMAT ... ROW FORMAT SERDE .... We already have a custom Parquet input and output format. I am wondering if I will need to create a custom serde. I am not really sure where to start for this. Furthermore, I am worried about changing schema, i.e. if we change our Thrift definitions and then try to read data that was written in the old format. This is because Hive uses the name to select a column,not Thrift IDs which never change. Do I need to point to the Thrift definition from inside the Parquet file to keep track of a changing schema? I feel like this does not make sense as Parquet is self-describing, storing its own schema in metadata. Any help would be appreciated. I have read a ton of documentation but nothing seems to address my (very specific) question. Thank you in advance.
