A few thoughts about this:
There's a few ways to think about column projection over unions:
1) As a *filter* not a projection. For example, if I have a projection like
`select(a.b.c.two)` where a.b.c is a union of {one,two,three}, then what
I'm really saying is "give me all the records *where* a.b.c *is* a two, and
then give me that data.
2) As only a projection, and so it's valid to say `select(a.b.c)` but
`select(a.b.c.two)` is nonsensical and not allowed.
3) The current way the parquet-thrift is implemented for unions is, when
you select some columns, if any of them select *part* of a union, an
arbitrary column is chosen from all the other *parts* of that union. This
is done in order to determine which kind of member a particular record was
for a given union. This only works because in thrift there's a wrapper
object per union member, so we can project all but 1 column away from that
type. In the case of a union of primitives, we wind up just keeping all the
primitives.
I actually like option 1 the best, it seems the most correct to me as far
as user intention goes.
On Tue, Nov 10, 2015 at 5:11 PM, Ryan Blue <[email protected]> wrote:
> Jason,
>
> Thanks for the thorough research here. This all sounds pretty good to me.
> I'll echo Julien's points about needing to define and document the UNION
> type annotation (OriginalType).
>
> I'd also like to add that we should define behavior around unions with
> null and what happens when projecting a subset of the union types.
>
> Avro's mapping leaves out null, so ["null", "int", "float"] becomes just
> two columns, an int and a float. I don't know how null is handled in
> thrift, but this seems like a reasonable way to handle it to me. We could
> also have an extra required boolean "isDefined" column, though I'm not sure
> that would be worth it.
>
> We have two options for projecting out union members: either return null
> because none of the projected columns are present or don't allow removing
> union members.
>
> For member naming, what is the value of requiring the name and the type? I
> think the main motivation for member names is to be able to reorder the
> union schema and still match up the columns between schema versions. For
> thrift, the only part that we need is the name. Avro is a bit different,
> but I don't think it will require the names at all so we could go with the
> current memberN format.
>
> rb
>
>
> On 11/04/2015 02:41 PM, Jason Altekruse wrote:
>
>> Hello Parquet devs,
>>
>> The Drill team is currently working on an implementation of Union type and
>> we have begun evaluating what is needed to make it work with all parts of
>> the engine. Two of the core features of Drill are the Parquet reader and
>> writer, which provide access to Drill's fastest input format (parquet file
>> creation is supported through CREATE TABLE AS statements). I have been
>> taking a look at the existing implementation of the Union type support
>> implemented in parquet-avro. It looks like Hive has not yet implemented
>> support for the Union type in parquet [1]. It looks like thrift unions are
>> implemented as well, but I haven't looked at them in detail.
>>
>> Our primary goal in our implementation will be handling the JSON data
>> model
>> accurately, as it is what Drill's data model has been based on. Take for
>> example this small set of JSON records. With the union type addition that
>> was recently merged into Drill, we have added support for these two data
>> types, integer and varchar to coexist in a single column with our new
>> Union
>> type.
>>
>> { "user_id" : "james" }
>> { "user_id" : 12345 }
>>
>> In addition to transitions between different scalar types, we also will
>> need to support transitioning any column into a complex type like a map or
>> list. Thus the following dataset would be supported as well. This extends
>> to requiring support for unions that themselves contain nested unions. I
>> believe that these requirements are going to be common among the other
>> object models.
>>
>> { "account_admin" : "james" }
>> { "account_admin" : 12345 }
>> { "account_admin" : [12345, 1000, 98765] }
>> { "account_admin" : ["Timothy", "Carl"] }
>> { "account_admin" : { "primary" : "jackie", "secondary" : "john" }
>>
>> // adding this record to the dataset is an example of requiring a union
>> within a union, as the nested columns have changed from sting to int
>>
>> { "account_admin" : { "primary" : 100001, "secondary" : 2000002 }
>>
>> The avro implementation of the Union type seems to require an
>> understanding
>> of the Avro schema that is stored in footer of the parquet file. As this
>> concept is extended to other object models like Drill and Hive, we think
>> it
>> would be useful to have a discussion around a standard definition of the
>> Union logical type as was done with the List and Map types here [2]. I am
>> thinking that this standard should involve a description of the union
>> types
>> that is independent of any one object model, and all of the object models
>> should map their features into a parquet standard logical Union type
>> definition.
>>
>> We discussed this briefly in the hangout this morning and I mentioned that
>> I as considering proposing a change from the current avro approach, using
>> numeric indices in the column names. Instead I would like to propose
>> putting the type name in the column name of each particular leaf inside of
>> a union. For context of those unfamiliar with unions in Parquet, as well
>> as
>> to confirm my understanding of the current avro model, here is an example
>> of how I believe this is handled today. For readability I'll just be using
>> JSON to describe the structure of the schema. I am going to say for now
>> that maps that appear in the document below will correspond to Parquet
>> groups, or intermediate nodes in the schema. They will not correspond to
>> the logical Map type that has been defined.
>>
>> For this small subset of the data from above:
>> { "account_admin" : "james" }
>> { "account_admin" : 12345 }
>>
>> The way I understand an Avro schema mapping this into parquet today it
>> would look like this:
>> { "account_admin" : { "member0" : "james", "member1" : null} }
>> { "account_admin" : { "member0" : null, "member1" : 12345 }
>>
>> Where member0 and member1 correspond to the position of these types as
>> specified in the avro schema definition.
>>
>> I was initially going to propose something like this, where the data types
>> would appear in the column names, this is a bit redundant as a parquet
>> type
>> (physical and logical) will be associated with each sub-column in the
>> schema anyway.
>> { "account_admin" : { "string" : "james", "int" : null} }
>> { "account_admin" : { "string" : null, "int" : 12345 }
>>
>> I am inclined to make a case to minimize the need for extra metadata from
>> the footer being necessary for understanding the data in the union. It
>> seems useful to try to enable object models without unions, but that do
>> support nested data, to get the data out of parquet in a format that is
>> reasonably well structured for understanding what is stored. This is still
>> an issue for Drill users that try to query parquet files with unions
>> today,
>> they will see the raw member0 and member1 field names. We intend to fix
>> this, and implement backward compatibility for the old structure used by
>> thrift and avro, but I think it would be useful to consider making
>> understanding the data in the file simpler when reading the data into a
>> format that lacks a union type.
>>
>> One thing that Julien mentioned this morning, Thirft unions are required
>> to
>> give a name to each type nested in the union, and they allow for a
>> particular type to appear more than once. Considering this I would propose
>> putting these column names into parquet along with the type stored in the
>> column, something like this:
>>
>> { "account_admin" : { "name_string" : "james", "userid_int" : null} }
>> { "account_admin" : { "name_string" : null, "userid_int" : 12345 }
>>
>> This obviously isn't a formal proposal, I just wanted to send out some a
>> summary of our primary requirements and the small amount of research I
>> have
>> done so far. Please chime in with feedback as well corrections to anything
>> I have stated that is incorrect.
>>
>> [1] -
>>
>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L115
>> [2] - https://github.com/apache/parquet-format/pull/17
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>
--
Alex Levenson
@THISWILLWORK