Oh, and one other note about thrift:

In thrift the field ID is the primary identifier, the name shouldn't really
be used to identify anything. It's safe to change the name of the union
members in a thrift IDL as long as the field IDs remain the same.

It'd be nice if parquet-format had a notion of field primary ID that could
be optionally decoupled from the field name.

On Tue, Nov 10, 2015 at 7:55 PM, Alex Levenson <[email protected]>
wrote:

> A few thoughts about this:
>
> There's a few ways to think about column projection over unions:
> 1) As a *filter* not a projection. For example, if I have a projection
> like `select(a.b.c.two)` where a.b.c is a union of {one,two,three}, then
> what I'm really saying is "give me all the records *where* a.b.c *is* a
> two, and then give me that data.
>
> 2) As only a projection, and so it's valid to say `select(a.b.c)` but
> `select(a.b.c.two)` is nonsensical and not allowed.
>
> 3) The current way the parquet-thrift is implemented for unions is, when
> you select some columns, if any of them select *part* of a union, an
> arbitrary column is chosen from all the other *parts* of that union. This
> is done in order to determine which kind of member a particular record was
> for a given union. This only works because in thrift there's a wrapper
> object per union member, so we can project all but 1 column away from that
> type. In the case of a union of primitives, we wind up just keeping all the
> primitives.
>
> I actually like option 1 the best, it seems the most correct to me as far
> as user intention goes.
>
> On Tue, Nov 10, 2015 at 5:11 PM, Ryan Blue <[email protected]> wrote:
>
>> Jason,
>>
>> Thanks for the thorough research here. This all sounds pretty good to me.
>> I'll echo Julien's points about needing to define and document the UNION
>> type annotation (OriginalType).
>>
>> I'd also like to add that we should define behavior around unions with
>> null and what happens when projecting a subset of the union types.
>>
>> Avro's mapping leaves out null, so ["null", "int", "float"] becomes just
>> two columns, an int and a float. I don't know how null is handled in
>> thrift, but this seems like a reasonable way to handle it to me. We could
>> also have an extra required boolean "isDefined" column, though I'm not sure
>> that would be worth it.
>>
>> We have two options for projecting out union members: either return null
>> because none of the projected columns are present or don't allow removing
>> union members.
>>
>> For member naming, what is the value of requiring the name and the type?
>> I think the main motivation for member names is to be able to reorder the
>> union schema and still match up the columns between schema versions. For
>> thrift, the only part that we need is the name. Avro is a bit different,
>> but I don't think it will require the names at all so we could go with the
>> current memberN format.
>>
>> rb
>>
>>
>> On 11/04/2015 02:41 PM, Jason Altekruse wrote:
>>
>>> Hello Parquet devs,
>>>
>>> The Drill team is currently working on an implementation of Union type
>>> and
>>> we have begun evaluating what is needed to make it work with all parts of
>>> the engine. Two of the core features of Drill are the Parquet reader and
>>> writer, which provide access to Drill's fastest input format (parquet
>>> file
>>> creation is supported through CREATE TABLE AS statements). I have been
>>> taking a look at the existing implementation of the Union type support
>>> implemented in parquet-avro. It looks like Hive has not yet implemented
>>> support for the Union type in parquet [1]. It looks like thrift unions
>>> are
>>> implemented as well, but I haven't looked at them in detail.
>>>
>>> Our primary goal in our implementation will be handling the JSON data
>>> model
>>> accurately, as it is what Drill's data model has been based on. Take for
>>> example this small set of JSON records. With the union type addition that
>>> was recently merged into Drill, we have added support for these two data
>>> types, integer and varchar to coexist in a single column with our new
>>> Union
>>> type.
>>>
>>> { "user_id" : "james" }
>>> { "user_id" : 12345 }
>>>
>>> In addition to transitions between different scalar types, we also will
>>> need to support transitioning any column into a complex type like a map
>>> or
>>> list. Thus the following dataset would be supported as well. This extends
>>> to requiring support for unions that themselves contain nested unions. I
>>> believe that these requirements are going to be common among the other
>>> object models.
>>>
>>> { "account_admin" : "james" }
>>> { "account_admin" : 12345 }
>>> { "account_admin" : [12345, 1000, 98765] }
>>> { "account_admin" : ["Timothy", "Carl"] }
>>> { "account_admin" : { "primary" : "jackie", "secondary" : "john" }
>>>
>>> // adding this record to the dataset is an example of requiring a union
>>> within a union, as the nested columns have changed from sting to int
>>>
>>> { "account_admin" : { "primary" : 100001, "secondary" : 2000002 }
>>>
>>> The avro implementation of the Union type seems to require an
>>> understanding
>>> of the Avro schema that is stored in footer of the parquet file. As this
>>> concept is extended to other object models like Drill and Hive, we think
>>> it
>>> would be useful to have a discussion around a standard definition of the
>>> Union logical type as was done with the List and Map types here [2]. I am
>>> thinking that this standard should involve a description of the union
>>> types
>>> that is independent of any one object model, and all of the object models
>>> should map their features into a parquet standard logical Union type
>>> definition.
>>>
>>> We discussed this briefly in the hangout this morning and I mentioned
>>> that
>>> I as considering proposing a change from the current avro approach, using
>>> numeric indices in the column names. Instead I would like to propose
>>> putting the type name in the column name of each particular leaf inside
>>> of
>>> a union. For context of those unfamiliar with unions in Parquet, as well
>>> as
>>> to confirm my understanding of the current avro model, here is an example
>>> of how I believe this is handled today. For readability I'll just be
>>> using
>>> JSON to describe the structure of the schema. I am going to say for now
>>> that maps that appear in the document below will correspond to Parquet
>>> groups, or intermediate nodes in the schema. They will not correspond to
>>> the logical Map type that has been defined.
>>>
>>> For this small subset of the data from above:
>>> { "account_admin" : "james" }
>>> { "account_admin" : 12345 }
>>>
>>> The way I understand an Avro schema mapping this into parquet today it
>>> would look like this:
>>> { "account_admin" : { "member0" : "james", "member1" : null} }
>>> { "account_admin" : { "member0" : null, "member1" : 12345 }
>>>
>>> Where member0 and member1 correspond to the position of these types as
>>> specified in the avro schema definition.
>>>
>>> I was initially going to propose something like this, where the data
>>> types
>>> would appear in the column names, this is a bit redundant as a parquet
>>> type
>>> (physical and logical) will be associated with each sub-column in the
>>> schema anyway.
>>> { "account_admin" : { "string" : "james", "int" : null} }
>>> { "account_admin" : { "string" : null, "int" : 12345 }
>>>
>>> I am inclined to make a case to minimize the need for extra metadata from
>>> the footer being necessary for understanding the data in the union. It
>>> seems useful to try to enable object models without unions, but that do
>>> support nested data, to get the data out of parquet in a format that is
>>> reasonably well structured for understanding what is stored. This is
>>> still
>>> an issue for Drill users that try to query parquet files with unions
>>> today,
>>> they will see the raw member0 and member1 field names. We intend to fix
>>> this, and implement backward compatibility for the old structure used by
>>> thrift and avro, but I think it would be useful to consider making
>>> understanding the data in the file simpler when reading the data into a
>>> format that lacks a union type.
>>>
>>> One thing that Julien mentioned this morning, Thirft unions are required
>>> to
>>> give a name to each type nested in the union, and they allow for a
>>> particular type to appear more than once. Considering this I would
>>> propose
>>> putting these column names into parquet along with the type stored in the
>>> column, something like this:
>>>
>>> { "account_admin" : { "name_string" : "james", "userid_int" : null} }
>>> { "account_admin" : { "name_string" : null, "userid_int" : 12345 }
>>>
>>> This obviously isn't a formal proposal, I just wanted to send out some a
>>> summary of our primary requirements and the small amount of research I
>>> have
>>> done so far. Please chime in with feedback as well corrections to
>>> anything
>>> I have stated that is incorrect.
>>>
>>> [1] -
>>>
>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L115
>>> [2] - https://github.com/apache/parquet-format/pull/17
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>



-- 
Alex Levenson
@THISWILLWORK

Reply via email to