[jira] [Created] (PARQUET-391) Parquet build fails with thrift9 profile
Yash Datta created PARQUET-391: -- Summary: Parquet build fails with thrift9 profile Key: PARQUET-391 URL: https://issues.apache.org/jira/browse/PARQUET-391 Project: Parquet Issue Type: Bug Reporter: Yash Datta compile parquet build using: mvn clean install -Pthrift9 -DskipTests build fails in parquet-cascading project : [INFO] - [ERROR] COMPILATION ERROR : [INFO] - [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32] package org.apache.thrift.scheme does not exist [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32] package org.apache.thrift.scheme does not exist [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32] package org.apache.thrift.scheme does not exist [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32] package org.apache.thrift.scheme does not exist [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34] cannot find symbol symbol: class TTupleProtocol location: package org.apache.thrift.protocol [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44] cannot find symbol symbol: class IScheme location: class parquet.thrift.test.Name [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54] cannot find symbol symbol: class SchemeFactory location: class parquet.thrift.test.Name [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61] cannot find symbol symbol: class SchemeFactory location: class parquet.thrift.test.Name [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51] cannot find symbol symbol: class StandardScheme location: class parquet.thrift.test.Name [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58] cannot find symbol symbol: class SchemeFactory location: class parquet.thrift.test.Name [ERROR] /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48] cannot find symbol -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Writing Parquet files from C++ or Go?
Ken, There is a parquet-cpp project [1], but I don't think it has write support. There is also the Impala code that you might want to check out [2]. That's ASLv2 licensed so you could either use it directly or use it as a guide to add a writer to the parquet-cpp project. rb [1]: https://github.com/apache/parquet-cpp [2]: https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-parquet-table-writer.cc On 11/05/2015 05:12 PM, Ken Sedgwick wrote: Greetings, We'd like to write Parquet files from a C++ or Go program. Has this been done before? Are there any resources to know about? Many thanks in advance! Ken -- Ryan Blue Software Engineer Cloudera, Inc.
[jira] [Commented] (PARQUET-391) Parquet build fails with thrift9 profile
[ https://issues.apache.org/jira/browse/PARQUET-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999133#comment-14999133 ] Ryan Blue commented on PARQUET-391: --- I think this is a duplicate of PARQUET-380. There's a PR with a fix here: https://github.com/apache/parquet-mr/pull/276 Is it okay with you if I close this and track it on the other issue? > Parquet build fails with thrift9 profile > - > > Key: PARQUET-391 > URL: https://issues.apache.org/jira/browse/PARQUET-391 > Project: Parquet > Issue Type: Bug >Reporter: Yash Datta > > compile parquet build using: > mvn clean install -Pthrift9 -DskipTests > build fails in parquet-cascading project : > [INFO] - > [ERROR] COMPILATION ERROR : > [INFO] - > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34] > cannot find symbol > symbol: class TTupleProtocol > location: package org.apache.thrift.protocol > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44] > cannot find symbol > symbol: class IScheme > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51] > cannot find symbol > symbol: class StandardScheme > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48] > cannot find symbol -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-391) Parquet build fails with thrift9 profile
[ https://issues.apache.org/jira/browse/PARQUET-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999166#comment-14999166 ] Yash Datta commented on PARQUET-391: I added a similar pull request ; if the other one looks better, please close this https://github.com/apache/parquet-mr/pull/287 > Parquet build fails with thrift9 profile > - > > Key: PARQUET-391 > URL: https://issues.apache.org/jira/browse/PARQUET-391 > Project: Parquet > Issue Type: Bug >Reporter: Yash Datta > > compile parquet build using: > mvn clean install -Pthrift9 -DskipTests > build fails in parquet-cascading project : > [INFO] - > [ERROR] COMPILATION ERROR : > [INFO] - > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34] > cannot find symbol > symbol: class TTupleProtocol > location: package org.apache.thrift.protocol > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44] > cannot find symbol > symbol: class IScheme > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51] > cannot find symbol > symbol: class StandardScheme > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48] > cannot find symbol -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PARQUET-391) Parquet build fails with thrift9 profile
[ https://issues.apache.org/jira/browse/PARQUET-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999166#comment-14999166 ] Yash Datta edited comment on PARQUET-391 at 11/10/15 7:52 PM: -- https://github.com/apache/parquet-mr/pull/287 closed in favor of https://github.com/apache/parquet-mr/pull/276 Please track in PARQUET-380 was (Author: saucam): https://github.com/apache/parquet-mr/pull/287 closed in favor of https://github.com/apache/parquet-mr/pull/276 > Parquet build fails with thrift9 profile > - > > Key: PARQUET-391 > URL: https://issues.apache.org/jira/browse/PARQUET-391 > Project: Parquet > Issue Type: Bug >Reporter: Yash Datta > > compile parquet build using: > mvn clean install -Pthrift9 -DskipTests > build fails in parquet-cascading project : > [INFO] - > [ERROR] COMPILATION ERROR : > [INFO] - > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32] > package org.apache.thrift.scheme does not exist > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34] > cannot find symbol > symbol: class TTupleProtocol > location: package org.apache.thrift.protocol > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44] > cannot find symbol > symbol: class IScheme > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51] > cannot find symbol > symbol: class StandardScheme > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58] > cannot find symbol > symbol: class SchemeFactory > location: class parquet.thrift.test.Name > [ERROR] > /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48] > cannot find symbol -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Proposal for Union type
Jason, Thanks for the thorough research here. This all sounds pretty good to me. I'll echo Julien's points about needing to define and document the UNION type annotation (OriginalType). I'd also like to add that we should define behavior around unions with null and what happens when projecting a subset of the union types. Avro's mapping leaves out null, so ["null", "int", "float"] becomes just two columns, an int and a float. I don't know how null is handled in thrift, but this seems like a reasonable way to handle it to me. We could also have an extra required boolean "isDefined" column, though I'm not sure that would be worth it. We have two options for projecting out union members: either return null because none of the projected columns are present or don't allow removing union members. For member naming, what is the value of requiring the name and the type? I think the main motivation for member names is to be able to reorder the union schema and still match up the columns between schema versions. For thrift, the only part that we need is the name. Avro is a bit different, but I don't think it will require the names at all so we could go with the current memberN format. rb On 11/04/2015 02:41 PM, Jason Altekruse wrote: Hello Parquet devs, The Drill team is currently working on an implementation of Union type and we have begun evaluating what is needed to make it work with all parts of the engine. Two of the core features of Drill are the Parquet reader and writer, which provide access to Drill's fastest input format (parquet file creation is supported through CREATE TABLE AS statements). I have been taking a look at the existing implementation of the Union type support implemented in parquet-avro. It looks like Hive has not yet implemented support for the Union type in parquet [1]. It looks like thrift unions are implemented as well, but I haven't looked at them in detail. Our primary goal in our implementation will be handling the JSON data model accurately, as it is what Drill's data model has been based on. Take for example this small set of JSON records. With the union type addition that was recently merged into Drill, we have added support for these two data types, integer and varchar to coexist in a single column with our new Union type. { "user_id" : "james" } { "user_id" : 12345 } In addition to transitions between different scalar types, we also will need to support transitioning any column into a complex type like a map or list. Thus the following dataset would be supported as well. This extends to requiring support for unions that themselves contain nested unions. I believe that these requirements are going to be common among the other object models. { "account_admin" : "james" } { "account_admin" : 12345 } { "account_admin" : [12345, 1000, 98765] } { "account_admin" : ["Timothy", "Carl"] } { "account_admin" : { "primary" : "jackie", "secondary" : "john" } // adding this record to the dataset is an example of requiring a union within a union, as the nested columns have changed from sting to int { "account_admin" : { "primary" : 11, "secondary" : 202 } The avro implementation of the Union type seems to require an understanding of the Avro schema that is stored in footer of the parquet file. As this concept is extended to other object models like Drill and Hive, we think it would be useful to have a discussion around a standard definition of the Union logical type as was done with the List and Map types here [2]. I am thinking that this standard should involve a description of the union types that is independent of any one object model, and all of the object models should map their features into a parquet standard logical Union type definition. We discussed this briefly in the hangout this morning and I mentioned that I as considering proposing a change from the current avro approach, using numeric indices in the column names. Instead I would like to propose putting the type name in the column name of each particular leaf inside of a union. For context of those unfamiliar with unions in Parquet, as well as to confirm my understanding of the current avro model, here is an example of how I believe this is handled today. For readability I'll just be using JSON to describe the structure of the schema. I am going to say for now that maps that appear in the document below will correspond to Parquet groups, or intermediate nodes in the schema. They will not correspond to the logical Map type that has been defined. For this small subset of the data from above: { "account_admin" : "james" } { "account_admin" : 12345 } The way I understand an Avro schema mapping this into parquet today it would look like this: { "account_admin" : { "member0" : "james", "member1" : null} } { "account_admin" : { "member0" : null, "member1" : 12345 } Where member0 and member1 correspond to the position of these types as specified in the avro schema definition. I was
Re: Proposal for Union type
Oh, and one other note about thrift: In thrift the field ID is the primary identifier, the name shouldn't really be used to identify anything. It's safe to change the name of the union members in a thrift IDL as long as the field IDs remain the same. It'd be nice if parquet-format had a notion of field primary ID that could be optionally decoupled from the field name. On Tue, Nov 10, 2015 at 7:55 PM, Alex Levensonwrote: > A few thoughts about this: > > There's a few ways to think about column projection over unions: > 1) As a *filter* not a projection. For example, if I have a projection > like `select(a.b.c.two)` where a.b.c is a union of {one,two,three}, then > what I'm really saying is "give me all the records *where* a.b.c *is* a > two, and then give me that data. > > 2) As only a projection, and so it's valid to say `select(a.b.c)` but > `select(a.b.c.two)` is nonsensical and not allowed. > > 3) The current way the parquet-thrift is implemented for unions is, when > you select some columns, if any of them select *part* of a union, an > arbitrary column is chosen from all the other *parts* of that union. This > is done in order to determine which kind of member a particular record was > for a given union. This only works because in thrift there's a wrapper > object per union member, so we can project all but 1 column away from that > type. In the case of a union of primitives, we wind up just keeping all the > primitives. > > I actually like option 1 the best, it seems the most correct to me as far > as user intention goes. > > On Tue, Nov 10, 2015 at 5:11 PM, Ryan Blue wrote: > >> Jason, >> >> Thanks for the thorough research here. This all sounds pretty good to me. >> I'll echo Julien's points about needing to define and document the UNION >> type annotation (OriginalType). >> >> I'd also like to add that we should define behavior around unions with >> null and what happens when projecting a subset of the union types. >> >> Avro's mapping leaves out null, so ["null", "int", "float"] becomes just >> two columns, an int and a float. I don't know how null is handled in >> thrift, but this seems like a reasonable way to handle it to me. We could >> also have an extra required boolean "isDefined" column, though I'm not sure >> that would be worth it. >> >> We have two options for projecting out union members: either return null >> because none of the projected columns are present or don't allow removing >> union members. >> >> For member naming, what is the value of requiring the name and the type? >> I think the main motivation for member names is to be able to reorder the >> union schema and still match up the columns between schema versions. For >> thrift, the only part that we need is the name. Avro is a bit different, >> but I don't think it will require the names at all so we could go with the >> current memberN format. >> >> rb >> >> >> On 11/04/2015 02:41 PM, Jason Altekruse wrote: >> >>> Hello Parquet devs, >>> >>> The Drill team is currently working on an implementation of Union type >>> and >>> we have begun evaluating what is needed to make it work with all parts of >>> the engine. Two of the core features of Drill are the Parquet reader and >>> writer, which provide access to Drill's fastest input format (parquet >>> file >>> creation is supported through CREATE TABLE AS statements). I have been >>> taking a look at the existing implementation of the Union type support >>> implemented in parquet-avro. It looks like Hive has not yet implemented >>> support for the Union type in parquet [1]. It looks like thrift unions >>> are >>> implemented as well, but I haven't looked at them in detail. >>> >>> Our primary goal in our implementation will be handling the JSON data >>> model >>> accurately, as it is what Drill's data model has been based on. Take for >>> example this small set of JSON records. With the union type addition that >>> was recently merged into Drill, we have added support for these two data >>> types, integer and varchar to coexist in a single column with our new >>> Union >>> type. >>> >>> { "user_id" : "james" } >>> { "user_id" : 12345 } >>> >>> In addition to transitions between different scalar types, we also will >>> need to support transitioning any column into a complex type like a map >>> or >>> list. Thus the following dataset would be supported as well. This extends >>> to requiring support for unions that themselves contain nested unions. I >>> believe that these requirements are going to be common among the other >>> object models. >>> >>> { "account_admin" : "james" } >>> { "account_admin" : 12345 } >>> { "account_admin" : [12345, 1000, 98765] } >>> { "account_admin" : ["Timothy", "Carl"] } >>> { "account_admin" : { "primary" : "jackie", "secondary" : "john" } >>> >>> // adding this record to the dataset is an example of requiring a union >>> within a union, as the nested columns have changed from sting to int >>>
Re: Proposal for Union type
A few thoughts about this: There's a few ways to think about column projection over unions: 1) As a *filter* not a projection. For example, if I have a projection like `select(a.b.c.two)` where a.b.c is a union of {one,two,three}, then what I'm really saying is "give me all the records *where* a.b.c *is* a two, and then give me that data. 2) As only a projection, and so it's valid to say `select(a.b.c)` but `select(a.b.c.two)` is nonsensical and not allowed. 3) The current way the parquet-thrift is implemented for unions is, when you select some columns, if any of them select *part* of a union, an arbitrary column is chosen from all the other *parts* of that union. This is done in order to determine which kind of member a particular record was for a given union. This only works because in thrift there's a wrapper object per union member, so we can project all but 1 column away from that type. In the case of a union of primitives, we wind up just keeping all the primitives. I actually like option 1 the best, it seems the most correct to me as far as user intention goes. On Tue, Nov 10, 2015 at 5:11 PM, Ryan Bluewrote: > Jason, > > Thanks for the thorough research here. This all sounds pretty good to me. > I'll echo Julien's points about needing to define and document the UNION > type annotation (OriginalType). > > I'd also like to add that we should define behavior around unions with > null and what happens when projecting a subset of the union types. > > Avro's mapping leaves out null, so ["null", "int", "float"] becomes just > two columns, an int and a float. I don't know how null is handled in > thrift, but this seems like a reasonable way to handle it to me. We could > also have an extra required boolean "isDefined" column, though I'm not sure > that would be worth it. > > We have two options for projecting out union members: either return null > because none of the projected columns are present or don't allow removing > union members. > > For member naming, what is the value of requiring the name and the type? I > think the main motivation for member names is to be able to reorder the > union schema and still match up the columns between schema versions. For > thrift, the only part that we need is the name. Avro is a bit different, > but I don't think it will require the names at all so we could go with the > current memberN format. > > rb > > > On 11/04/2015 02:41 PM, Jason Altekruse wrote: > >> Hello Parquet devs, >> >> The Drill team is currently working on an implementation of Union type and >> we have begun evaluating what is needed to make it work with all parts of >> the engine. Two of the core features of Drill are the Parquet reader and >> writer, which provide access to Drill's fastest input format (parquet file >> creation is supported through CREATE TABLE AS statements). I have been >> taking a look at the existing implementation of the Union type support >> implemented in parquet-avro. It looks like Hive has not yet implemented >> support for the Union type in parquet [1]. It looks like thrift unions are >> implemented as well, but I haven't looked at them in detail. >> >> Our primary goal in our implementation will be handling the JSON data >> model >> accurately, as it is what Drill's data model has been based on. Take for >> example this small set of JSON records. With the union type addition that >> was recently merged into Drill, we have added support for these two data >> types, integer and varchar to coexist in a single column with our new >> Union >> type. >> >> { "user_id" : "james" } >> { "user_id" : 12345 } >> >> In addition to transitions between different scalar types, we also will >> need to support transitioning any column into a complex type like a map or >> list. Thus the following dataset would be supported as well. This extends >> to requiring support for unions that themselves contain nested unions. I >> believe that these requirements are going to be common among the other >> object models. >> >> { "account_admin" : "james" } >> { "account_admin" : 12345 } >> { "account_admin" : [12345, 1000, 98765] } >> { "account_admin" : ["Timothy", "Carl"] } >> { "account_admin" : { "primary" : "jackie", "secondary" : "john" } >> >> // adding this record to the dataset is an example of requiring a union >> within a union, as the nested columns have changed from sting to int >> >> { "account_admin" : { "primary" : 11, "secondary" : 202 } >> >> The avro implementation of the Union type seems to require an >> understanding >> of the Avro schema that is stored in footer of the parquet file. As this >> concept is extended to other object models like Drill and Hive, we think >> it >> would be useful to have a discussion around a standard definition of the >> Union logical type as was done with the List and Map types here [2]. I am >> thinking that this standard should involve a description of the union >> types >> that is independent of any one object