Hello Parquet devs,
The Drill team is currently working on an implementation of Union type and
we have begun evaluating what is needed to make it work with all parts of
the engine. Two of the core features of Drill are the Parquet reader and
writer, which provide access to Drill's fastest input format (parquet file
creation is supported through CREATE TABLE AS statements). I have been
taking a look at the existing implementation of the Union type support
implemented in parquet-avro. It looks like Hive has not yet implemented
support for the Union type in parquet [1]. It looks like thrift unions are
implemented as well, but I haven't looked at them in detail.
Our primary goal in our implementation will be handling the JSON data model
accurately, as it is what Drill's data model has been based on. Take for
example this small set of JSON records. With the union type addition that
was recently merged into Drill, we have added support for these two data
types, integer and varchar to coexist in a single column with our new Union
type.
{ "user_id" : "james" }
{ "user_id" : 12345 }
In addition to transitions between different scalar types, we also will
need to support transitioning any column into a complex type like a map or
list. Thus the following dataset would be supported as well. This extends
to requiring support for unions that themselves contain nested unions. I
believe that these requirements are going to be common among the other
object models.
{ "account_admin" : "james" }
{ "account_admin" : 12345 }
{ "account_admin" : [12345, 1000, 98765] }
{ "account_admin" : ["Timothy", "Carl"] }
{ "account_admin" : { "primary" : "jackie", "secondary" : "john" }
// adding this record to the dataset is an example of requiring a union
within a union, as the nested columns have changed from sting to int
{ "account_admin" : { "primary" : 100001, "secondary" : 2000002 }
The avro implementation of the Union type seems to require an understanding
of the Avro schema that is stored in footer of the parquet file. As this
concept is extended to other object models like Drill and Hive, we think it
would be useful to have a discussion around a standard definition of the
Union logical type as was done with the List and Map types here [2]. I am
thinking that this standard should involve a description of the union types
that is independent of any one object model, and all of the object models
should map their features into a parquet standard logical Union type
definition.
We discussed this briefly in the hangout this morning and I mentioned that
I as considering proposing a change from the current avro approach, using
numeric indices in the column names. Instead I would like to propose
putting the type name in the column name of each particular leaf inside of
a union. For context of those unfamiliar with unions in Parquet, as well as
to confirm my understanding of the current avro model, here is an example
of how I believe this is handled today. For readability I'll just be using
JSON to describe the structure of the schema. I am going to say for now
that maps that appear in the document below will correspond to Parquet
groups, or intermediate nodes in the schema. They will not correspond to
the logical Map type that has been defined.
For this small subset of the data from above:
{ "account_admin" : "james" }
{ "account_admin" : 12345 }
The way I understand an Avro schema mapping this into parquet today it
would look like this:
{ "account_admin" : { "member0" : "james", "member1" : null} }
{ "account_admin" : { "member0" : null, "member1" : 12345 }
Where member0 and member1 correspond to the position of these types as
specified in the avro schema definition.
I was initially going to propose something like this, where the data types
would appear in the column names, this is a bit redundant as a parquet type
(physical and logical) will be associated with each sub-column in the
schema anyway.
{ "account_admin" : { "string" : "james", "int" : null} }
{ "account_admin" : { "string" : null, "int" : 12345 }
I am inclined to make a case to minimize the need for extra metadata from
the footer being necessary for understanding the data in the union. It
seems useful to try to enable object models without unions, but that do
support nested data, to get the data out of parquet in a format that is
reasonably well structured for understanding what is stored. This is still
an issue for Drill users that try to query parquet files with unions today,
they will see the raw member0 and member1 field names. We intend to fix
this, and implement backward compatibility for the old structure used by
thrift and avro, but I think it would be useful to consider making
understanding the data in the file simpler when reading the data into a
format that lacks a union type.
One thing that Julien mentioned this morning, Thirft unions are required to
give a name to each type nested in the union, and they allow for a
particular type to appear more than once. Considering this I would propose
putting these column names into parquet along with the type stored in the
column, something like this:
{ "account_admin" : { "name_string" : "james", "userid_int" : null} }
{ "account_admin" : { "name_string" : null, "userid_int" : 12345 }
This obviously isn't a formal proposal, I just wanted to send out some a
summary of our primary requirements and the small amount of research I have
done so far. Please chime in with feedback as well corrections to anything
I have stated that is incorrect.
[1] -
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L115
[2] - https://github.com/apache/parquet-format/pull/17