[
https://issues.apache.org/jira/browse/TAJO-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyunsik Choi reassigned TAJO-710:
---------------------------------
Assignee: Hyunsik Choi (was: David Chen)
> Add support for nested schemas and non-scalar types
> ---------------------------------------------------
>
> Key: TAJO-710
> URL: https://issues.apache.org/jira/browse/TAJO-710
> Project: Tajo
> Issue Type: New Feature
> Components: data type
> Reporter: David Chen
> Assignee: Hyunsik Choi
>
> Add support for nested schemas and non-scalar types (maps, arrays, enums, and
> unions). Here are some ways other systems handle nested schemas:
> * Pig and Hive uses complex data types, such as bags, structs, arrays, etc.
> * Impala doesn't support nested schemas or non-scalar data types
> (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_unsupported.html)
> and disallows complex types in their Parquet support
> (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_parquet.html).
> * Presto also does not support non-scalar types
> (http://prestodb.io/docs/current/language/types.html)
> From the discussion in TAJO-30:
> {quote}
> I have a plan for nested schema. Currently, Tajo only supports a flat schema
> like relational DBMS. So, even though Tajo is extended to nested data mode,
> it will not break the compatibility.
> I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery).
> When I consider nested data model, I thought two main points. Parquet data
> model satisfies with these points. The first point that I've thought is the
> processing model on nested data. Parquet data model is the same to that of
> BigQuery, and BigQuery already concreted the processing model including
> flattening, cross production on repeated fields, and aggregation on repeated
> fields [1][2]. The second point is file format. Parquet is a native file
> format for this model. Parquet already includes the efficient record assembly
> method. Besides, Parquet is already mature and is widely used in many systems.
> [1] http://research.google.com/pubs/pub36632.html
> [2] https://developers.google.com/bigquery/docs/data
> I'm thinking that we need three stages for this work. Firstly, we can start
> with a small change to improve our schema system. Then, we will add some
> physical operator to just flatten one nested row into a number of flattened
> rows. Finally, we will solve some query optimization issues like
> projection/filter push down on nested schema and will add some physical
> operators to directly process nested rows.
> If you have any idea, feel free to share with us.
> Thanks,
> Hyunsik
> {quote}
> This ticket may need to be broken up into multiple sub-tasks. Each sub-task
> will involve defining an extension to the query language to support the data
> type, implementing the new data type, then adding support for the data type
> in each of the storage types. I have opened tickets for each of these four
> tasks but not as subtasks because it is very likely that each of these tasks
> will have subtasks of their own:
> * TAJO-721: Adding support for nested records
> * TAJO-722: Adding support for maps
> * TAJO-723: Adding support for array
> * TAJO-724: Adding support for unions
> Adding support for the enum type can be a consideration, but is lower
> priority than the other four complex types. Neither Hive nor Pig currently
> have an enum type (even though storage formats such as Avro and Parquet do)
> and, I believe, simply convert enum values to strings.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)