[
https://issues.apache.org/jira/browse/TAJO-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Chen updated TAJO-809:
----------------------------
Description:
This ticket is to track the work for defining the syntax for nested schemas,
maps, arrays, and unions and the work for adding the syntax to the parser.
Initially, we can add stubs for the parser endpoints that will then be fleshed
out when support for the data type is actually implemented (see other subtasks
of TAJO-710).
I have an idea of a possible DDL syntax for these types, and I would like to
get your feedback on it. I considered just using Hive's syntax but I felt that
it was not the best syntax for these types.
Instead of calling nested records "structs" like the way Hive does, I simply
call them records as well and use the same syntax used for declaring the
top-level record fields:
{code}
create table record_example (
nested_field record (
field1 int,
field2 double),
two_levels_nested record (
inner_nested record (
field3 string,
field4 int),
field5 int),
) using parquet;
{code}
For arrays, maps, and unions, I am using a syntax inspired by Scala's syntax
for generics:
{code}
create table array_example (
int_array array[int],
record_array array[record (
field1 int,
field2 string)]
) using avro;
create table map_example (
string_to_int map[string, int],
int_to_record map[int, record (
field1 string,
field2 int)],
) using avro;
create table union_example (
integers union[bit, smallint, integer, bigint]
) using parquet;
{code}
Of course, it is possible that when we implement these data types, we may make
changes to the syntax, but for now, I think we should define an initial
language. Once the initial syntax has stabilized, I will write a formal grammar
for it.
was:
This ticket is to track the work for defining the syntax for nested schemas,
maps, arrays, and unions and the work for adding the syntax to the parser.
Initially, we can add stubs for the parser endpoints that will then be fleshed
out when support for the data type is actually implemented (see other subtasks
of TAJO-710).
I have an idea of a possible DDL syntax for these types, and I would like to
get your feedback on it. I considered just using Hive's syntax but I felt that
it was not the best syntax for these types.
Instead of calling nested records "structs" like the way Hive does, I simply
call them records as well and use the same syntax used for declaring the
top-level record fields:
{code}
create table record_example (
nested_field record (
field1 int,
field2 double),
two_levels_nested record (
inner_nested record (
field3 string,
field4 int),
field5 int),
) using parquet;
{code}
For arrays, maps, and unions, I am using a syntax inspired by Scala's syntax
for generics:
{code}
create table array_example (
int_array array[int],
record_array array[record (
field1 int,
field2 string)]
) using avro;
create table map_example (
string_to_int map[string, int],
int_to_record map[int, record (
field1 string,
field2 int)],
) using avro;
create table union_example (
integers union[bit, smallint, integer, bigint]
) using parquet;
{code}
Of course, it is possible that when we implement these data types, we may make
changes to the syntax, but for now, I think we should define an initial
language. Once the initial syntax has stabilized, I will write a formal grammar
for it.
> Langauge extension for non-scalar types
> ---------------------------------------
>
> Key: TAJO-809
> URL: https://issues.apache.org/jira/browse/TAJO-809
> Project: Tajo
> Issue Type: New Feature
> Reporter: David Chen
>
> This ticket is to track the work for defining the syntax for nested schemas,
> maps, arrays, and unions and the work for adding the syntax to the parser.
> Initially, we can add stubs for the parser endpoints that will then be
> fleshed out when support for the data type is actually implemented (see other
> subtasks of TAJO-710).
> I have an idea of a possible DDL syntax for these types, and I would like to
> get your feedback on it. I considered just using Hive's syntax but I felt
> that it was not the best syntax for these types.
> Instead of calling nested records "structs" like the way Hive does, I simply
> call them records as well and use the same syntax used for declaring the
> top-level record fields:
> {code}
> create table record_example (
> nested_field record (
> field1 int,
> field2 double),
> two_levels_nested record (
> inner_nested record (
> field3 string,
> field4 int),
> field5 int),
> ) using parquet;
> {code}
> For arrays, maps, and unions, I am using a syntax inspired by Scala's syntax
> for generics:
> {code}
> create table array_example (
> int_array array[int],
> record_array array[record (
> field1 int,
> field2 string)]
> ) using avro;
> create table map_example (
> string_to_int map[string, int],
> int_to_record map[int, record (
> field1 string,
> field2 int)],
> ) using avro;
> create table union_example (
> integers union[bit, smallint, integer, bigint]
> ) using parquet;
> {code}
> Of course, it is possible that when we implement these data types, we may
> make changes to the syntax, but for now, I think we should define an initial
> language. Once the initial syntax has stabilized, I will write a formal
> grammar for it.
--
This message was sent by Atlassian JIRA
(v6.2#6252)