Re: Creating Hive tables on Parquet data with a custom Thrift generator

Julien Le Dem Mon, 07 Dec 2015 10:49:53 -0800

(CC'ing some folks who may have more context)
There's a setting to make the column lookup by index instead (ignoring
names in the file):
https://github.com/apache/parquet-mr/search?utf8=%E2%9C%93&q=PARQUET_COLUMN_INDEX_ACCESS
(remember that the serde code has moved to hive itself so parquet-hive is
for older versions of Hive)
Netflix uses PARQUET_COLUMN_INDEX_ACCESS so that Parquet files behaves like
sequence files in that regard.
In Hive the schema in the metastore is the reference. So if you change
something in your thrift you will have to alter the schema in Hive (alter
table ...).


if looking up by name you can add or remove a column but not rename it.
if looking up by index you can add or rename a column but not remove it (as
it would change subsequent indices).

The parquet metadata keeps the thrift id in the schema:
https://github.com/apache/parquet-mr/blob/14097c64d243794610788d3ebb2e81ba8fd867c0/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java#L244

Which means that you could implement a third strategy (contributions
welcome) assuming Hive can be made to add thrift ids to the metastore.




On Mon, Dec 7, 2015 at 9:35 AM, Dmitriy Ryaboy <[email protected]> wrote:

> Hi Stephen,
> I'm not sure I follow your scenario.
> So you have your own Thrift generator; you then write (outside of Hive,
> using your own input/output format) the Thrift objects out into HDFS using
> your own custom Parquet output format. You have two questiosn:
> 1. Is there anything special you need to tell Hive to make these parquet
> files readable, other than that they are Parquet?
> 2. How does one deal with thrift evolution?
>
> Assuming I have this right, here are some answers:
> 1. There should not be; "stored as parquet" looks at the schema stored in
> parquet. If you have errors, it may be that (a) there are bugs in how your
> custom OF writes parquet metadata, or (b) there are bugs in Hive's parquet
> reader -- please do share if you find any.
> 2. Schema evolution is rough. At some point, the pig loader used names
> instead of ids, Pig would break from renaming a field without changing a
> type. Not sure how Hive reader is implemented, someone else on this list
> might know. I think it's actually pretty static, since all the column
> names, etc, get written into the metastore by hive. So you'd have to
> refresh table definitions even if you simply add a new column, much less
> change an old one.
>
>
> On Mon, Dec 7, 2015 at 8:12 AM, Stephen Bly <[email protected]> wrote:
>
> > Greetings Parquet experts. I am in need of a little help.
> >
> > I am a (very) Junior developer at my company, and I have been tasked with
> > adding the Parquet file format to our Hadoop ecosystem. Our main use case
> > is in creating Hive tables on Parquet data and querying them.
> >
> > As you know, Hive can create Parquet tables with the STORED AS PARQUET
> > command. However, we use a custom Thrift generator to generate Scala code
> > (similar to Twitter’s Scrooge, if you’re familiar with that). Thus, I am
> > not sure if the above command will work.
> >
> > I tested it out and it is hit and miss -- I'm able to create tables, but
> > often get errors when querying them that I am still investigating.
> >
> > Hive allows you to be more custom by using INPUT FORMAT ... OUTPUT FORMAT
> > ... ROW FORMAT SERDE .... We already have a custom Parquet input and
> output
> > format. I am wondering if I will need to create a custom serde. I am not
> > really sure where to start for this.
> >
> > Furthermore, I am worried about changing schema, i.e. if we change our
> > Thrift definitions and then try to read data that was written in the old
> > format. This is because Hive uses the name to select a column,not Thrift
> > IDs which never change. Do I need to point to the Thrift definition from
> > inside the Parquet file to keep track of a changing schema? I feel like
> > this does not make sense as Parquet is self-describing, storing its own
> > schema in metadata.
> >
> > Any help would be appreciated. I have read a ton of documentation but
> > nothing seems to address my (very specific) question. Thank you in
> advance.
>



-- 
Julien

Re: Creating Hive tables on Parquet data with a custom Thrift generator

Reply via email to