This sounds good to me.
We should have a UNION logical type in parquet-format to capture this
information.
A UNION type is defined as a GROUP and should always have exactly one field
populated.
By default the name of the field is the type name but in the case of thrift
it is provided by the IDL.
We should also have the field ID set (in avro the index, in thrift the ID
from the IDL).
I think you should send a pull request to parquet-format with an update to
the doc similar to the LIST and MAP doc you pointed to.
Also update the thrift IDL with the definition.
Thanks for starting the discussion

On Wed, Nov 4, 2015 at 2:41 PM, Jason Altekruse <[email protected]>
wrote:

> Hello Parquet devs,
>
> The Drill team is currently working on an implementation of Union type and
> we have begun evaluating what is needed to make it work with all parts of
> the engine. Two of the core features of Drill are the Parquet reader and
> writer, which provide access to Drill's fastest input format (parquet file
> creation is supported through CREATE TABLE AS statements). I have been
> taking a look at the existing implementation of the Union type support
> implemented in parquet-avro. It looks like Hive has not yet implemented
> support for the Union type in parquet [1]. It looks like thrift unions are
> implemented as well, but I haven't looked at them in detail.
>
> Our primary goal in our implementation will be handling the JSON data model
> accurately, as it is what Drill's data model has been based on. Take for
> example this small set of JSON records. With the union type addition that
> was recently merged into Drill, we have added support for these two data
> types, integer and varchar to coexist in a single column with our new Union
> type.
>
> { "user_id" : "james" }
> { "user_id" : 12345 }
>
> In addition to transitions between different scalar types, we also will
> need to support transitioning any column into a complex type like a map or
> list. Thus the following dataset would be supported as well. This extends
> to requiring support for unions that themselves contain nested unions. I
> believe that these requirements are going to be common among the other
> object models.
>
> { "account_admin" : "james" }
> { "account_admin" : 12345 }
> { "account_admin" : [12345, 1000, 98765] }
> { "account_admin" : ["Timothy", "Carl"] }
> { "account_admin" : { "primary" : "jackie", "secondary" : "john" }
>
> // adding this record to the dataset is an example of requiring a union
> within a union, as the nested columns have changed from sting to int
>
> { "account_admin" : { "primary" : 100001, "secondary" : 2000002 }
>
> The avro implementation of the Union type seems to require an understanding
> of the Avro schema that is stored in footer of the parquet file. As this
> concept is extended to other object models like Drill and Hive, we think it
> would be useful to have a discussion around a standard definition of the
> Union logical type as was done with the List and Map types here [2]. I am
> thinking that this standard should involve a description of the union types
> that is independent of any one object model, and all of the object models
> should map their features into a parquet standard logical Union type
> definition.
>
> We discussed this briefly in the hangout this morning and I mentioned that
> I as considering proposing a change from the current avro approach, using
> numeric indices in the column names. Instead I would like to propose
> putting the type name in the column name of each particular leaf inside of
> a union. For context of those unfamiliar with unions in Parquet, as well as
> to confirm my understanding of the current avro model, here is an example
> of how I believe this is handled today. For readability I'll just be using
> JSON to describe the structure of the schema. I am going to say for now
> that maps that appear in the document below will correspond to Parquet
> groups, or intermediate nodes in the schema. They will not correspond to
> the logical Map type that has been defined.
>
> For this small subset of the data from above:
> { "account_admin" : "james" }
> { "account_admin" : 12345 }
>
> The way I understand an Avro schema mapping this into parquet today it
> would look like this:
> { "account_admin" : { "member0" : "james", "member1" : null} }
> { "account_admin" : { "member0" : null, "member1" : 12345 }
>
> Where member0 and member1 correspond to the position of these types as
> specified in the avro schema definition.
>
> I was initially going to propose something like this, where the data types
> would appear in the column names, this is a bit redundant as a parquet type
> (physical and logical) will be associated with each sub-column in the
> schema anyway.
> { "account_admin" : { "string" : "james", "int" : null} }
> { "account_admin" : { "string" : null, "int" : 12345 }
>
> I am inclined to make a case to minimize the need for extra metadata from
> the footer being necessary for understanding the data in the union. It
> seems useful to try to enable object models without unions, but that do
> support nested data, to get the data out of parquet in a format that is
> reasonably well structured for understanding what is stored. This is still
> an issue for Drill users that try to query parquet files with unions today,
> they will see the raw member0 and member1 field names. We intend to fix
> this, and implement backward compatibility for the old structure used by
> thrift and avro, but I think it would be useful to consider making
> understanding the data in the file simpler when reading the data into a
> format that lacks a union type.
>
> One thing that Julien mentioned this morning, Thirft unions are required to
> give a name to each type nested in the union, and they allow for a
> particular type to appear more than once. Considering this I would propose
> putting these column names into parquet along with the type stored in the
> column, something like this:
>
> { "account_admin" : { "name_string" : "james", "userid_int" : null} }
> { "account_admin" : { "name_string" : null, "userid_int" : 12345 }
>
> This obviously isn't a formal proposal, I just wanted to send out some a
> summary of our primary requirements and the small amount of research I have
> done so far. Please chime in with feedback as well corrections to anything
> I have stated that is incorrect.
>
> [1] -
>
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L115
> [2] - https://github.com/apache/parquet-format/pull/17
>



-- 
Julien

Reply via email to