[ 
https://issues.apache.org/jira/browse/TAJO-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyunsik Choi reassigned TAJO-710:
---------------------------------

    Assignee: Hyunsik Choi  (was: David Chen)

> Add support for nested schemas and non-scalar types
> ---------------------------------------------------
>
>                 Key: TAJO-710
>                 URL: https://issues.apache.org/jira/browse/TAJO-710
>             Project: Tajo
>          Issue Type: New Feature
>          Components: data type
>            Reporter: David Chen
>            Assignee: Hyunsik Choi
>
> Add support for nested schemas and non-scalar types (maps, arrays, enums, and 
> unions). Here are some ways other systems handle nested schemas:
>  * Pig and Hive uses complex data types, such as bags, structs, arrays, etc.
>  * Impala doesn't support nested schemas or non-scalar data types 
> (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_unsupported.html)
>  and disallows complex types in their Parquet support 
> (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_parquet.html).
>  * Presto also does not support non-scalar types 
> (http://prestodb.io/docs/current/language/types.html)
> From the discussion in TAJO-30:
> {quote}
> I have a plan for nested schema. Currently, Tajo only supports a flat schema 
> like relational DBMS. So, even though Tajo is extended to nested data mode, 
> it will not break the compatibility.
> I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery). 
> When I consider nested data model, I thought two main points. Parquet data 
> model satisfies with these points. The first point that I've thought is the 
> processing model on nested data. Parquet data model is the same to that of 
> BigQuery, and BigQuery already concreted the processing model including 
> flattening, cross production on repeated fields, and aggregation on repeated 
> fields [1][2]. The second point is file format. Parquet is a native file 
> format for this model. Parquet already includes the efficient record assembly 
> method. Besides, Parquet is already mature and is widely used in many systems.
> [1] http://research.google.com/pubs/pub36632.html
> [2] https://developers.google.com/bigquery/docs/data
> I'm thinking that we need three stages for this work. Firstly, we can start 
> with a small change to improve our schema system. Then, we will add some 
> physical operator to just flatten one nested row into a number of flattened 
> rows. Finally, we will solve some query optimization issues like 
> projection/filter push down on nested schema and will add some physical 
> operators to directly process nested rows.
> If you have any idea, feel free to share with us.
> Thanks,
> Hyunsik
> {quote}
> This ticket may need to be broken up into multiple sub-tasks. Each sub-task 
> will involve defining an extension to the query language to support the data 
> type, implementing the new data type, then adding support for the data type 
> in each of the storage types. I have opened tickets for each of these four 
> tasks but not as subtasks because it is very likely that each of these tasks 
> will have subtasks of their own:
>  * TAJO-721: Adding support for nested records
>  * TAJO-722: Adding support for maps
>  * TAJO-723: Adding support for array
>  * TAJO-724: Adding support for unions
> Adding support for the enum type can be a consideration, but is lower 
> priority than the other four complex types. Neither Hive nor Pig currently 
> have an enum type (even though storage formats such as Avro and Parquet do) 
> and, I believe, simply convert enum values to strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to