[jira] [Created] (PARQUET-391) Parquet build fails with thrift9 profile

2015-11-10 Thread Yash Datta (JIRA)
Yash Datta created PARQUET-391:
--

 Summary: Parquet build fails with thrift9 profile 
 Key: PARQUET-391
 URL: https://issues.apache.org/jira/browse/PARQUET-391
 Project: Parquet
  Issue Type: Bug
Reporter: Yash Datta


compile parquet build using:

mvn clean install -Pthrift9 -DskipTests

build fails in parquet-cascading project :

[INFO] -
[ERROR] COMPILATION ERROR :
[INFO] -
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32]
 package org.apache.thrift.scheme does not exist
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32]
 package org.apache.thrift.scheme does not exist
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32]
 package org.apache.thrift.scheme does not exist
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32]
 package org.apache.thrift.scheme does not exist
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34]
 cannot find symbol
  symbol:   class TTupleProtocol
  location: package org.apache.thrift.protocol
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44]
 cannot find symbol
  symbol:   class IScheme
  location: class parquet.thrift.test.Name
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54]
 cannot find symbol
  symbol:   class SchemeFactory
  location: class parquet.thrift.test.Name
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61]
 cannot find symbol
  symbol:   class SchemeFactory
  location: class parquet.thrift.test.Name
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51]
 cannot find symbol
  symbol:   class StandardScheme
  location: class parquet.thrift.test.Name
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58]
 cannot find symbol
  symbol:   class SchemeFactory
  location: class parquet.thrift.test.Name
[ERROR] 
/mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48]
 cannot find symbol




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Writing Parquet files from C++ or Go?

2015-11-10 Thread Ryan Blue

Ken,

There is a parquet-cpp project [1], but I don't think it has write 
support. There is also the Impala code that you might want to check out 
[2]. That's ASLv2 licensed so you could either use it directly or use it 
as a guide to add a writer to the parquet-cpp project.


rb

[1]: https://github.com/apache/parquet-cpp
[2]: 
https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-parquet-table-writer.cc


On 11/05/2015 05:12 PM, Ken Sedgwick wrote:

Greetings,

We'd like to write Parquet files from a C++ or Go program.  Has this been
done before?  Are there any resources to know about?

Many thanks in advance!

Ken




--
Ryan Blue
Software Engineer
Cloudera, Inc.


[jira] [Commented] (PARQUET-391) Parquet build fails with thrift9 profile

2015-11-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999133#comment-14999133
 ] 

Ryan Blue commented on PARQUET-391:
---

I think this is a duplicate of PARQUET-380. There's a PR with a fix here: 
https://github.com/apache/parquet-mr/pull/276

Is it okay with you if I close this and track it on the other issue?

> Parquet build fails with thrift9 profile 
> -
>
> Key: PARQUET-391
> URL: https://issues.apache.org/jira/browse/PARQUET-391
> Project: Parquet
>  Issue Type: Bug
>Reporter: Yash Datta
>
> compile parquet build using:
> mvn clean install -Pthrift9 -DskipTests
> build fails in parquet-cascading project :
> [INFO] -
> [ERROR] COMPILATION ERROR :
> [INFO] -
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34]
>  cannot find symbol
>   symbol:   class TTupleProtocol
>   location: package org.apache.thrift.protocol
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44]
>  cannot find symbol
>   symbol:   class IScheme
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51]
>  cannot find symbol
>   symbol:   class StandardScheme
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48]
>  cannot find symbol



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-391) Parquet build fails with thrift9 profile

2015-11-10 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999166#comment-14999166
 ] 

Yash Datta commented on PARQUET-391:


I added  a similar pull request ; if the other one looks better, please close 
this 

https://github.com/apache/parquet-mr/pull/287

> Parquet build fails with thrift9 profile 
> -
>
> Key: PARQUET-391
> URL: https://issues.apache.org/jira/browse/PARQUET-391
> Project: Parquet
>  Issue Type: Bug
>Reporter: Yash Datta
>
> compile parquet build using:
> mvn clean install -Pthrift9 -DskipTests
> build fails in parquet-cascading project :
> [INFO] -
> [ERROR] COMPILATION ERROR :
> [INFO] -
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34]
>  cannot find symbol
>   symbol:   class TTupleProtocol
>   location: package org.apache.thrift.protocol
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44]
>  cannot find symbol
>   symbol:   class IScheme
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51]
>  cannot find symbol
>   symbol:   class StandardScheme
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48]
>  cannot find symbol



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-391) Parquet build fails with thrift9 profile

2015-11-10 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999166#comment-14999166
 ] 

Yash Datta edited comment on PARQUET-391 at 11/10/15 7:52 PM:
--

https://github.com/apache/parquet-mr/pull/287


closed in favor of 

https://github.com/apache/parquet-mr/pull/276

Please track in PARQUET-380


was (Author: saucam):

https://github.com/apache/parquet-mr/pull/287


closed in favor of 

https://github.com/apache/parquet-mr/pull/276

> Parquet build fails with thrift9 profile 
> -
>
> Key: PARQUET-391
> URL: https://issues.apache.org/jira/browse/PARQUET-391
> Project: Parquet
>  Issue Type: Bug
>Reporter: Yash Datta
>
> compile parquet build using:
> mvn clean install -Pthrift9 -DskipTests
> build fails in parquet-cascading project :
> [INFO] -
> [ERROR] COMPILATION ERROR :
> [INFO] -
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[10,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[11,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[12,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[14,32]
>  package org.apache.thrift.scheme does not exist
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[15,34]
>  cannot find symbol
>   symbol:   class TTupleProtocol
>   location: package org.apache.thrift.protocol
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,44]
>  cannot find symbol
>   symbol:   class IScheme
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[40,54]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[395,61]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[401,51]
>  cannot find symbol
>   symbol:   class StandardScheme
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[462,58]
>  cannot find symbol
>   symbol:   class SchemeFactory
>   location: class parquet.thrift.test.Name
> [ERROR] 
> /mnt/devel/yash/parquet-mr-1/parquet-cascading/target/generated-test-sources/thrift/parquet/thrift/test/Name.java:[468,48]
>  cannot find symbol



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Proposal for Union type

2015-11-10 Thread Ryan Blue

Jason,

Thanks for the thorough research here. This all sounds pretty good to 
me. I'll echo Julien's points about needing to define and document the 
UNION type annotation (OriginalType).


I'd also like to add that we should define behavior around unions with 
null and what happens when projecting a subset of the union types.


Avro's mapping leaves out null, so ["null", "int", "float"] becomes just 
two columns, an int and a float. I don't know how null is handled in 
thrift, but this seems like a reasonable way to handle it to me. We 
could also have an extra required boolean "isDefined" column, though I'm 
not sure that would be worth it.


We have two options for projecting out union members: either return null 
because none of the projected columns are present or don't allow 
removing union members.


For member naming, what is the value of requiring the name and the type? 
I think the main motivation for member names is to be able to reorder 
the union schema and still match up the columns between schema versions. 
For thrift, the only part that we need is the name. Avro is a bit 
different, but I don't think it will require the names at all so we 
could go with the current memberN format.


rb

On 11/04/2015 02:41 PM, Jason Altekruse wrote:

Hello Parquet devs,

The Drill team is currently working on an implementation of Union type and
we have begun evaluating what is needed to make it work with all parts of
the engine. Two of the core features of Drill are the Parquet reader and
writer, which provide access to Drill's fastest input format (parquet file
creation is supported through CREATE TABLE AS statements). I have been
taking a look at the existing implementation of the Union type support
implemented in parquet-avro. It looks like Hive has not yet implemented
support for the Union type in parquet [1]. It looks like thrift unions are
implemented as well, but I haven't looked at them in detail.

Our primary goal in our implementation will be handling the JSON data model
accurately, as it is what Drill's data model has been based on. Take for
example this small set of JSON records. With the union type addition that
was recently merged into Drill, we have added support for these two data
types, integer and varchar to coexist in a single column with our new Union
type.

{ "user_id" : "james" }
{ "user_id" : 12345 }

In addition to transitions between different scalar types, we also will
need to support transitioning any column into a complex type like a map or
list. Thus the following dataset would be supported as well. This extends
to requiring support for unions that themselves contain nested unions. I
believe that these requirements are going to be common among the other
object models.

{ "account_admin" : "james" }
{ "account_admin" : 12345 }
{ "account_admin" : [12345, 1000, 98765] }
{ "account_admin" : ["Timothy", "Carl"] }
{ "account_admin" : { "primary" : "jackie", "secondary" : "john" }

// adding this record to the dataset is an example of requiring a union
within a union, as the nested columns have changed from sting to int

{ "account_admin" : { "primary" : 11, "secondary" : 202 }

The avro implementation of the Union type seems to require an understanding
of the Avro schema that is stored in footer of the parquet file. As this
concept is extended to other object models like Drill and Hive, we think it
would be useful to have a discussion around a standard definition of the
Union logical type as was done with the List and Map types here [2]. I am
thinking that this standard should involve a description of the union types
that is independent of any one object model, and all of the object models
should map their features into a parquet standard logical Union type
definition.

We discussed this briefly in the hangout this morning and I mentioned that
I as considering proposing a change from the current avro approach, using
numeric indices in the column names. Instead I would like to propose
putting the type name in the column name of each particular leaf inside of
a union. For context of those unfamiliar with unions in Parquet, as well as
to confirm my understanding of the current avro model, here is an example
of how I believe this is handled today. For readability I'll just be using
JSON to describe the structure of the schema. I am going to say for now
that maps that appear in the document below will correspond to Parquet
groups, or intermediate nodes in the schema. They will not correspond to
the logical Map type that has been defined.

For this small subset of the data from above:
{ "account_admin" : "james" }
{ "account_admin" : 12345 }

The way I understand an Avro schema mapping this into parquet today it
would look like this:
{ "account_admin" : { "member0" : "james", "member1" : null} }
{ "account_admin" : { "member0" : null, "member1" : 12345 }

Where member0 and member1 correspond to the position of these types as
specified in the avro schema definition.

I was 

Re: Proposal for Union type

2015-11-10 Thread Alex Levenson
Oh, and one other note about thrift:

In thrift the field ID is the primary identifier, the name shouldn't really
be used to identify anything. It's safe to change the name of the union
members in a thrift IDL as long as the field IDs remain the same.

It'd be nice if parquet-format had a notion of field primary ID that could
be optionally decoupled from the field name.

On Tue, Nov 10, 2015 at 7:55 PM, Alex Levenson 
wrote:

> A few thoughts about this:
>
> There's a few ways to think about column projection over unions:
> 1) As a *filter* not a projection. For example, if I have a projection
> like `select(a.b.c.two)` where a.b.c is a union of {one,two,three}, then
> what I'm really saying is "give me all the records *where* a.b.c *is* a
> two, and then give me that data.
>
> 2) As only a projection, and so it's valid to say `select(a.b.c)` but
> `select(a.b.c.two)` is nonsensical and not allowed.
>
> 3) The current way the parquet-thrift is implemented for unions is, when
> you select some columns, if any of them select *part* of a union, an
> arbitrary column is chosen from all the other *parts* of that union. This
> is done in order to determine which kind of member a particular record was
> for a given union. This only works because in thrift there's a wrapper
> object per union member, so we can project all but 1 column away from that
> type. In the case of a union of primitives, we wind up just keeping all the
> primitives.
>
> I actually like option 1 the best, it seems the most correct to me as far
> as user intention goes.
>
> On Tue, Nov 10, 2015 at 5:11 PM, Ryan Blue  wrote:
>
>> Jason,
>>
>> Thanks for the thorough research here. This all sounds pretty good to me.
>> I'll echo Julien's points about needing to define and document the UNION
>> type annotation (OriginalType).
>>
>> I'd also like to add that we should define behavior around unions with
>> null and what happens when projecting a subset of the union types.
>>
>> Avro's mapping leaves out null, so ["null", "int", "float"] becomes just
>> two columns, an int and a float. I don't know how null is handled in
>> thrift, but this seems like a reasonable way to handle it to me. We could
>> also have an extra required boolean "isDefined" column, though I'm not sure
>> that would be worth it.
>>
>> We have two options for projecting out union members: either return null
>> because none of the projected columns are present or don't allow removing
>> union members.
>>
>> For member naming, what is the value of requiring the name and the type?
>> I think the main motivation for member names is to be able to reorder the
>> union schema and still match up the columns between schema versions. For
>> thrift, the only part that we need is the name. Avro is a bit different,
>> but I don't think it will require the names at all so we could go with the
>> current memberN format.
>>
>> rb
>>
>>
>> On 11/04/2015 02:41 PM, Jason Altekruse wrote:
>>
>>> Hello Parquet devs,
>>>
>>> The Drill team is currently working on an implementation of Union type
>>> and
>>> we have begun evaluating what is needed to make it work with all parts of
>>> the engine. Two of the core features of Drill are the Parquet reader and
>>> writer, which provide access to Drill's fastest input format (parquet
>>> file
>>> creation is supported through CREATE TABLE AS statements). I have been
>>> taking a look at the existing implementation of the Union type support
>>> implemented in parquet-avro. It looks like Hive has not yet implemented
>>> support for the Union type in parquet [1]. It looks like thrift unions
>>> are
>>> implemented as well, but I haven't looked at them in detail.
>>>
>>> Our primary goal in our implementation will be handling the JSON data
>>> model
>>> accurately, as it is what Drill's data model has been based on. Take for
>>> example this small set of JSON records. With the union type addition that
>>> was recently merged into Drill, we have added support for these two data
>>> types, integer and varchar to coexist in a single column with our new
>>> Union
>>> type.
>>>
>>> { "user_id" : "james" }
>>> { "user_id" : 12345 }
>>>
>>> In addition to transitions between different scalar types, we also will
>>> need to support transitioning any column into a complex type like a map
>>> or
>>> list. Thus the following dataset would be supported as well. This extends
>>> to requiring support for unions that themselves contain nested unions. I
>>> believe that these requirements are going to be common among the other
>>> object models.
>>>
>>> { "account_admin" : "james" }
>>> { "account_admin" : 12345 }
>>> { "account_admin" : [12345, 1000, 98765] }
>>> { "account_admin" : ["Timothy", "Carl"] }
>>> { "account_admin" : { "primary" : "jackie", "secondary" : "john" }
>>>
>>> // adding this record to the dataset is an example of requiring a union
>>> within a union, as the nested columns have changed from sting to int
>>>

Re: Proposal for Union type

2015-11-10 Thread Alex Levenson
A few thoughts about this:

There's a few ways to think about column projection over unions:
1) As a *filter* not a projection. For example, if I have a projection like
`select(a.b.c.two)` where a.b.c is a union of {one,two,three}, then what
I'm really saying is "give me all the records *where* a.b.c *is* a two, and
then give me that data.

2) As only a projection, and so it's valid to say `select(a.b.c)` but
`select(a.b.c.two)` is nonsensical and not allowed.

3) The current way the parquet-thrift is implemented for unions is, when
you select some columns, if any of them select *part* of a union, an
arbitrary column is chosen from all the other *parts* of that union. This
is done in order to determine which kind of member a particular record was
for a given union. This only works because in thrift there's a wrapper
object per union member, so we can project all but 1 column away from that
type. In the case of a union of primitives, we wind up just keeping all the
primitives.

I actually like option 1 the best, it seems the most correct to me as far
as user intention goes.

On Tue, Nov 10, 2015 at 5:11 PM, Ryan Blue  wrote:

> Jason,
>
> Thanks for the thorough research here. This all sounds pretty good to me.
> I'll echo Julien's points about needing to define and document the UNION
> type annotation (OriginalType).
>
> I'd also like to add that we should define behavior around unions with
> null and what happens when projecting a subset of the union types.
>
> Avro's mapping leaves out null, so ["null", "int", "float"] becomes just
> two columns, an int and a float. I don't know how null is handled in
> thrift, but this seems like a reasonable way to handle it to me. We could
> also have an extra required boolean "isDefined" column, though I'm not sure
> that would be worth it.
>
> We have two options for projecting out union members: either return null
> because none of the projected columns are present or don't allow removing
> union members.
>
> For member naming, what is the value of requiring the name and the type? I
> think the main motivation for member names is to be able to reorder the
> union schema and still match up the columns between schema versions. For
> thrift, the only part that we need is the name. Avro is a bit different,
> but I don't think it will require the names at all so we could go with the
> current memberN format.
>
> rb
>
>
> On 11/04/2015 02:41 PM, Jason Altekruse wrote:
>
>> Hello Parquet devs,
>>
>> The Drill team is currently working on an implementation of Union type and
>> we have begun evaluating what is needed to make it work with all parts of
>> the engine. Two of the core features of Drill are the Parquet reader and
>> writer, which provide access to Drill's fastest input format (parquet file
>> creation is supported through CREATE TABLE AS statements). I have been
>> taking a look at the existing implementation of the Union type support
>> implemented in parquet-avro. It looks like Hive has not yet implemented
>> support for the Union type in parquet [1]. It looks like thrift unions are
>> implemented as well, but I haven't looked at them in detail.
>>
>> Our primary goal in our implementation will be handling the JSON data
>> model
>> accurately, as it is what Drill's data model has been based on. Take for
>> example this small set of JSON records. With the union type addition that
>> was recently merged into Drill, we have added support for these two data
>> types, integer and varchar to coexist in a single column with our new
>> Union
>> type.
>>
>> { "user_id" : "james" }
>> { "user_id" : 12345 }
>>
>> In addition to transitions between different scalar types, we also will
>> need to support transitioning any column into a complex type like a map or
>> list. Thus the following dataset would be supported as well. This extends
>> to requiring support for unions that themselves contain nested unions. I
>> believe that these requirements are going to be common among the other
>> object models.
>>
>> { "account_admin" : "james" }
>> { "account_admin" : 12345 }
>> { "account_admin" : [12345, 1000, 98765] }
>> { "account_admin" : ["Timothy", "Carl"] }
>> { "account_admin" : { "primary" : "jackie", "secondary" : "john" }
>>
>> // adding this record to the dataset is an example of requiring a union
>> within a union, as the nested columns have changed from sting to int
>>
>> { "account_admin" : { "primary" : 11, "secondary" : 202 }
>>
>> The avro implementation of the Union type seems to require an
>> understanding
>> of the Avro schema that is stored in footer of the parquet file. As this
>> concept is extended to other object models like Drill and Hive, we think
>> it
>> would be useful to have a discussion around a standard definition of the
>> Union logical type as was done with the List and Map types here [2]. I am
>> thinking that this standard should involve a description of the union
>> types
>> that is independent of any one object