Oh, and one other note about thrift: In thrift the field ID is the primary identifier, the name shouldn't really be used to identify anything. It's safe to change the name of the union members in a thrift IDL as long as the field IDs remain the same.
It'd be nice if parquet-format had a notion of field primary ID that could be optionally decoupled from the field name. On Tue, Nov 10, 2015 at 7:55 PM, Alex Levenson <[email protected]> wrote: > A few thoughts about this: > > There's a few ways to think about column projection over unions: > 1) As a *filter* not a projection. For example, if I have a projection > like `select(a.b.c.two)` where a.b.c is a union of {one,two,three}, then > what I'm really saying is "give me all the records *where* a.b.c *is* a > two, and then give me that data. > > 2) As only a projection, and so it's valid to say `select(a.b.c)` but > `select(a.b.c.two)` is nonsensical and not allowed. > > 3) The current way the parquet-thrift is implemented for unions is, when > you select some columns, if any of them select *part* of a union, an > arbitrary column is chosen from all the other *parts* of that union. This > is done in order to determine which kind of member a particular record was > for a given union. This only works because in thrift there's a wrapper > object per union member, so we can project all but 1 column away from that > type. In the case of a union of primitives, we wind up just keeping all the > primitives. > > I actually like option 1 the best, it seems the most correct to me as far > as user intention goes. > > On Tue, Nov 10, 2015 at 5:11 PM, Ryan Blue <[email protected]> wrote: > >> Jason, >> >> Thanks for the thorough research here. This all sounds pretty good to me. >> I'll echo Julien's points about needing to define and document the UNION >> type annotation (OriginalType). >> >> I'd also like to add that we should define behavior around unions with >> null and what happens when projecting a subset of the union types. >> >> Avro's mapping leaves out null, so ["null", "int", "float"] becomes just >> two columns, an int and a float. I don't know how null is handled in >> thrift, but this seems like a reasonable way to handle it to me. We could >> also have an extra required boolean "isDefined" column, though I'm not sure >> that would be worth it. >> >> We have two options for projecting out union members: either return null >> because none of the projected columns are present or don't allow removing >> union members. >> >> For member naming, what is the value of requiring the name and the type? >> I think the main motivation for member names is to be able to reorder the >> union schema and still match up the columns between schema versions. For >> thrift, the only part that we need is the name. Avro is a bit different, >> but I don't think it will require the names at all so we could go with the >> current memberN format. >> >> rb >> >> >> On 11/04/2015 02:41 PM, Jason Altekruse wrote: >> >>> Hello Parquet devs, >>> >>> The Drill team is currently working on an implementation of Union type >>> and >>> we have begun evaluating what is needed to make it work with all parts of >>> the engine. Two of the core features of Drill are the Parquet reader and >>> writer, which provide access to Drill's fastest input format (parquet >>> file >>> creation is supported through CREATE TABLE AS statements). I have been >>> taking a look at the existing implementation of the Union type support >>> implemented in parquet-avro. It looks like Hive has not yet implemented >>> support for the Union type in parquet [1]. It looks like thrift unions >>> are >>> implemented as well, but I haven't looked at them in detail. >>> >>> Our primary goal in our implementation will be handling the JSON data >>> model >>> accurately, as it is what Drill's data model has been based on. Take for >>> example this small set of JSON records. With the union type addition that >>> was recently merged into Drill, we have added support for these two data >>> types, integer and varchar to coexist in a single column with our new >>> Union >>> type. >>> >>> { "user_id" : "james" } >>> { "user_id" : 12345 } >>> >>> In addition to transitions between different scalar types, we also will >>> need to support transitioning any column into a complex type like a map >>> or >>> list. Thus the following dataset would be supported as well. This extends >>> to requiring support for unions that themselves contain nested unions. I >>> believe that these requirements are going to be common among the other >>> object models. >>> >>> { "account_admin" : "james" } >>> { "account_admin" : 12345 } >>> { "account_admin" : [12345, 1000, 98765] } >>> { "account_admin" : ["Timothy", "Carl"] } >>> { "account_admin" : { "primary" : "jackie", "secondary" : "john" } >>> >>> // adding this record to the dataset is an example of requiring a union >>> within a union, as the nested columns have changed from sting to int >>> >>> { "account_admin" : { "primary" : 100001, "secondary" : 2000002 } >>> >>> The avro implementation of the Union type seems to require an >>> understanding >>> of the Avro schema that is stored in footer of the parquet file. As this >>> concept is extended to other object models like Drill and Hive, we think >>> it >>> would be useful to have a discussion around a standard definition of the >>> Union logical type as was done with the List and Map types here [2]. I am >>> thinking that this standard should involve a description of the union >>> types >>> that is independent of any one object model, and all of the object models >>> should map their features into a parquet standard logical Union type >>> definition. >>> >>> We discussed this briefly in the hangout this morning and I mentioned >>> that >>> I as considering proposing a change from the current avro approach, using >>> numeric indices in the column names. Instead I would like to propose >>> putting the type name in the column name of each particular leaf inside >>> of >>> a union. For context of those unfamiliar with unions in Parquet, as well >>> as >>> to confirm my understanding of the current avro model, here is an example >>> of how I believe this is handled today. For readability I'll just be >>> using >>> JSON to describe the structure of the schema. I am going to say for now >>> that maps that appear in the document below will correspond to Parquet >>> groups, or intermediate nodes in the schema. They will not correspond to >>> the logical Map type that has been defined. >>> >>> For this small subset of the data from above: >>> { "account_admin" : "james" } >>> { "account_admin" : 12345 } >>> >>> The way I understand an Avro schema mapping this into parquet today it >>> would look like this: >>> { "account_admin" : { "member0" : "james", "member1" : null} } >>> { "account_admin" : { "member0" : null, "member1" : 12345 } >>> >>> Where member0 and member1 correspond to the position of these types as >>> specified in the avro schema definition. >>> >>> I was initially going to propose something like this, where the data >>> types >>> would appear in the column names, this is a bit redundant as a parquet >>> type >>> (physical and logical) will be associated with each sub-column in the >>> schema anyway. >>> { "account_admin" : { "string" : "james", "int" : null} } >>> { "account_admin" : { "string" : null, "int" : 12345 } >>> >>> I am inclined to make a case to minimize the need for extra metadata from >>> the footer being necessary for understanding the data in the union. It >>> seems useful to try to enable object models without unions, but that do >>> support nested data, to get the data out of parquet in a format that is >>> reasonably well structured for understanding what is stored. This is >>> still >>> an issue for Drill users that try to query parquet files with unions >>> today, >>> they will see the raw member0 and member1 field names. We intend to fix >>> this, and implement backward compatibility for the old structure used by >>> thrift and avro, but I think it would be useful to consider making >>> understanding the data in the file simpler when reading the data into a >>> format that lacks a union type. >>> >>> One thing that Julien mentioned this morning, Thirft unions are required >>> to >>> give a name to each type nested in the union, and they allow for a >>> particular type to appear more than once. Considering this I would >>> propose >>> putting these column names into parquet along with the type stored in the >>> column, something like this: >>> >>> { "account_admin" : { "name_string" : "james", "userid_int" : null} } >>> { "account_admin" : { "name_string" : null, "userid_int" : 12345 } >>> >>> This obviously isn't a formal proposal, I just wanted to send out some a >>> summary of our primary requirements and the small amount of research I >>> have >>> done so far. Please chime in with feedback as well corrections to >>> anything >>> I have stated that is incorrect. >>> >>> [1] - >>> >>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L115 >>> [2] - https://github.com/apache/parquet-format/pull/17 >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Cloudera, Inc. >> > > > > -- > Alex Levenson > @THISWILLWORK > -- Alex Levenson @THISWILLWORK
