Hi Stephen,
Good questions. I think there is a slight misunderstanding with some of
the components, so I'll go over how they relate to one another first.
There are several different object models -- ways of working with data
in memory -- including Thrift, Hive, and Avro (to name just one other).
These are interchangeable and you use just one at a time. Compatibility
is handled at the API where the object models "plug in". So when you use
Hive to read Parquet data, it always uses the in-memory objects that
Hive expects and that doesn't change if you wrote the data with Thrift
or Avro or some other model. For you, that means that you can create
data with your thrift-based system and have Hive read it without knowing
anything about the writer.
You might wonder how Hive and Thrift can agree on a table schema. That's
what the Parquet schema is used for. Every object model translates its
schema to the Parquet schema and puts that in the file. Then object
models resolve their expected schema (e.g., the Hive table definition)
with the file schema when reading a Parquet file. As long as your table
schema can read the file schema, you should be fine. Hive's translation
to a file schema is pretty sane, so I don't think you will have much
trouble.
For schema evolution, the issues you raise just mean that certain
evolution steps should be avoided. The fact that columns are resolved by
name means that you should never reuse names and should not rename columns.
Does that answer some of your questions?
rb
On 12/07/2015 08:12 AM, Stephen Bly wrote:
Greetings Parquet experts. I am in need of a little help.
I am a (very) Junior developer at my company, and I have been tasked with
adding the Parquet file format to our Hadoop ecosystem. Our main use case is in
creating Hive tables on Parquet data and querying them.
As you know, Hive can create Parquet tables with the STORED AS PARQUET command.
However, we use a custom Thrift generator to generate Scala code (similar to
Twitter’s Scrooge, if you’re familiar with that). Thus, I am not sure if the
above command will work.
I tested it out and it is hit and miss -- I'm able to create tables, but often
get errors when querying them that I am still investigating.
Hive allows you to be more custom by using INPUT FORMAT ... OUTPUT FORMAT ...
ROW FORMAT SERDE .... We already have a custom Parquet input and output format.
I am wondering if I will need to create a custom serde. I am not really sure
where to start for this.
Furthermore, I am worried about changing schema, i.e. if we change our Thrift
definitions and then try to read data that was written in the old format. This
is because Hive uses the name to select a column,not Thrift IDs which never
change. Do I need to point to the Thrift definition from inside the Parquet
file to keep track of a changing schema? I feel like this does not make sense
as Parquet is self-describing, storing its own schema in metadata.
Any help would be appreciated. I have read a ton of documentation but nothing
seems to address my (very specific) question. Thank you in advance.
--
Ryan Blue
Software Engineer
Cloudera, Inc.