On 12/07/2015 11:21 AM, Stephen Bly wrote:
Thank you all for your detailed responses. Let me make sure I have this right:
I can write the Parquet file in any way I want, including using our own custom
Thrift code. Hive does not care, because it will used the schema stored in the
Parquet file together with the schema I specified when creating the table
(stored in the Hive Metastore) to read in the Parquet data into its own
in-memory Hive format (what is this Hive format by the way? Where are the
classes it uses?). This is why I can simply do `STORED AS PARQUET` and
everything should work.
This would be great, except that I have two requirements I have to meet:
1) We need to be able to rename, delete, and add thrift fields, and continue to
be able to read from Parquet files/tables created with the old format, without
any alterations to the table.
2) The user should not have to specify the table columns when creating a Hive
table. This information should be figured out by looking at the Thrift class
from which the file was generated.
According to what Julien said, 1) is not possible: I have to pick two out of
three ways for the schema to evolve. I don’t quite understand why, if we use
lookup by index by setting the mentioned configuration property, we can not
remove fields. What do you mean “it would change subsequent indices”? For
example, let’s say we have a Thrift struct with field ids 1, 2, 3. What happens
if we remove field 2?
We can (and should) add support for this eventually, but it isn't done
today. Contributions in this area are certainly welcome!
For 2), I think I could just use the normal `STORED AS PARQUET` but set a
metadata/serde property on the table that points to the Thrift class, which
would allow us to automatically generate the column names + types. Does that
seem reasonable?
I think it is easier and more general than that. Thrift already creates
the Parquet schema, so you should be able to create a table definition
from a Parquet file without even worrying about the original Thrift
class. There are a couple of ways to do this, but none that I know of
within Hive.
First, the Kite command-line tools can create a Hive table from Parquet
files by pointing at a directory of data. That will inspect all files'
schemas, union them, and convert the result to a Hive table definition.
It does this by using an intermediate Avro schema, but it should work
for you just fine.
Second, Impala has a SQL command, CREATE TABLE LIKE FILE <f.parquet>,
that will inspect a single file and give you a table definition that can
read it.
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.