[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296230#comment-16296230
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/20/17 6:41 AM:
--------------------------------------------------------------

h4. Lists

_*NOTE:* This section describes the Drill {{LIST}} type which turns out to be 
broken and not supported. The following is based on a prototype using the 
{{LIST}} type created after fixing some, but not all, {{LIST}} bugs._

JSON supports arrays of the form:

{code}
{a: ["I'm", "an", "array"] }
{code}

Drill has two very different ways to represent arrays:

1. As a {{REPEATED}} cardinality for most data types. This gives rise to a 
{{RepeatedFooVector}} for some type {{Foo}}.
2. As a {{LIST}} type with the {{ListVector}} implementation.

Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and 
the {{LIST}} vector type into a single concept. Drill, however, still has two 
systems.

h4. Repeated Cardinality

Drill's "go to" way to handle arrays is with the {{REPEATED}} cardinality (AKA 
"repeated data mode.") Most readers that handle arrays use the {{REPEATED}} 
form. To help understand the {{LIST}} type, we review {{REPEATED}} support here.

When working with a {{REPEATED} column, the rules for nulls  are:

* Arrays may not contain nulls. (Drill does not support nulls as array 
elements.)
* A null (or missing) array field is treated the same as an empty array.

If JSON were to use the {{REPEATED}} vectors, the following would be invalid:

{code}
[10, null, 20]
{code}

The following are all valid with {{REPEATED}} vectors:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

h4. Properties of Lists

The key properties of a List relative to a repeated type are:

* Each list can be a list of nothing, a list of a single type, or a list of a 
union of multiple types.
* Each list value can be null.
* Each list entry can be null (for primitive types.)

h4. Null Support

As explained below, lists support three kinds of nullability:

* The type itself can be null (a list of nulls)
* The list column value for a row can be null. (This is in contrast to repeated 
types in which an array can be empty, but the entire array value cannot be 
null.)
* When the list is of a primitive type, entries can be null. (The list is 
defined as list of nullable items, such as nullable BIGINT.)
* When the list is of maps, the map entries *cannot* be null. (Instead, the map 
columns are nullable and all columns for the "null" map are set to null.)
* When the list is of other lists, the list entry *cannot* be null. (Instead, 
the nested list is empty.)

The semantics are a bit confusing when seen from the outside. They make 
slightly more sense based on the implementation choices made in the code. 
(Though, generally we want the code to match our requirements, not the other 
way around.)

h4. Lists are Obscure

The {{LIST}} type appears to be used only for JSON, and it is unclear how well 
supported it is in the rest of Drill. For example, it is not clear that 
functions that work with arrays correctly handle null entries. (This needs to 
be tested.) 

JDBC supports array columns, but it is not clear if the Drill JDBC driver has 
implemented them. ODBC doesn't support arrays at all, so whether it supports 
arrays with nulls is a moot point.

h4. Lists in JSON

The {{LIST}} type appears to be used only for JSON where it is a better fit for 
JSON semantics than Drill's normal {{REPEATED}} cardinality. The list type 
allows list members to be null. All of the following are legal using lists:

{code}
{a: null}
{a: []}
{a: [null, null]}
{a: [null, 10, null]}
{a: [10, "foo"]}
{code}

We'll look at each of these in detail.

h4. Degenerate Lists

Consider the simplest possible list in JSON: a file that contains only an empty 
list:

{noformat}
{a: []}
{noformat}

What is the type of the list? In JSON, lists have no type, they are just lists. 
Drill requires a type, however when working with a {{REPEATED}} cardinality: 
the column must be an array of something.

Lists, however, can be a list of only nulls using the obscure {{LATE}} data 
type. That is, the list exists, but has no type. ({{LATE}} seems to suggest 
that the type will be assigned later.)

Next, consider another degenerate array:

{noformat}
{a: [null, null]}
{noformat}

Here we have an array of nulls. Again, we don't know what type these are a null 
of. Again, a LIST allows the JSON reader to produce a row with a single column 
{{`a`}} that is of type {{LIST}} that contains only the "dummy" {{LATE}} type. 
The list will indicate that we have two entries, both of which are null.

It is unclear, however, if the rest of Drill supports this concept. (DRILL-5970 
discusses a case in which an empty array, with a List of {{LATE}}, is exported 
to Parquet, producing results different than one might naively expect.)

h4. Single-type Lists

The typical JSON use case is create lists with a single type, optionally 
containing nulls:

{noformat}
{a: [10, null, 20] }
{noformat}

In this case, the List operates in its' second mode: as a list of a single 
type, such as a list of BIGINT above. Note that the list itself handles the 
nullability of each entry. (But, to be compatible with unions, the list 
actually contains a list of nullable BIGINT, so the actual implementation has 
two separate null bits. It is not clear if the code properly handles the four 
resulting null bit combinations: (null list entry, null value), (null list 
entry, not-null value), (not-null list entry, null value), (not-null list 
entry, not-null value). It may be that code assumes the two null bits to have 
the same value, but this is an awkwardness in the current design.

As JSON reads, it starts the list as a list of {{LATE}} (nulls). Only when it 
sees the first non-null value does it change the list to a list of something 
(the type for the first non-null value.)

h4. List of Union

Finally we come to the third of the list's three modes. The list starts with 0 
types, then adds one type. If union support is enabled, the list can grow to 
include other types as well, becoming a list of unions:

{noformat}
{a: [null, 10, "foo", 12.5, {b: "a map!"}, ["a", "list"] }
{noformat}

As it turns out, union support is very experimental: it works in JSON, but most 
of the rest of Drill does not support it. (More on unions later.)


was (Author: paul.rogers):
h4. Lists

JSON supports arrays of the form:

{code}
{a: ["I'm", "an", "array"] }
{code}

Drill has two very different ways to represent arrays:

1. As a {{REPEATED}} cardinality for most data types. This gives rise to a 
{{RepeatedFooVector}} for some type {{Foo}}.
2. As a {{LIST}} type with the {{ListVector}} implementation.

Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and 
the {{LIST}} vector type into a single concept. Drill, however, still has two 
systems.

h4. Repeated Cardinality

Drill's "go to" way to handle arrays is with the {{REPEATED}} cardinality (AKA 
"repeated data mode.") Most readers that handle arrays use the {{REPEATED}} 
form. To help understand the {{LIST}} type, we review {{REPEATED}} support here.

When working with a {{REPEATED} column, the rules for nulls  are:

* Arrays may not contain nulls. (Drill does not support nulls as array 
elements.)
* A null (or missing) array field is treated the same as an empty array.

If JSON were to use the {{REPEATED}} vectors, the following would be invalid:

{code}
[10, null, 20]
{code}

The following are all valid with {{REPEATED}} vectors:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

h4. Properties of Lists

The key properties of a List relative to a repeated type are:

* Each list can be a list of nothing, a list of a single type, or a list of a 
union of multiple types.
* Each list value can be null.
* Each list entry can be null (for primitive types.)

h4. Null Support

As explained below, lists support three kinds of nullability:

* The type itself can be null (a list of nulls)
* The list column value for a row can be null. (This is in contrast to repeated 
types in which an array can be empty, but the entire array value cannot be 
null.)
* When the list is of a primitive type, entries can be null. (The list is 
defined as list of nullable items, such as nullable BIGINT.)
* When the list is of maps, the map entries *cannot* be null. (Instead, the map 
columns are nullable and all columns for the "null" map are set to null.)
* When the list is of other lists, the list entry *cannot* be null. (Instead, 
the nested list is empty.)

The semantics are a bit confusing when seen from the outside. They make 
slightly more sense based on the implementation choices made in the code. 
(Though, generally we want the code to match our requirements, not the other 
way around.)

h4. Lists are Obscure

The {{LIST}} type appears to be used only for JSON, and it is unclear how well 
supported it is in the rest of Drill. For example, it is not clear that 
functions that work with arrays correctly handle null entries. (This needs to 
be tested.) 

JDBC supports array columns, but it is not clear if the Drill JDBC driver has 
implemented them. ODBC doesn't support arrays at all, so whether it supports 
arrays with nulls is a moot point.

h4. Lists in JSON

The {{LIST}} type appears to be used only for JSON where it is a better fit for 
JSON semantics than Drill's normal {{REPEATED}} cardinality. The list type 
allows list members to be null. All of the following are legal using lists:

{code}
{a: null}
{a: []}
{a: [null, null]}
{a: [null, 10, null]}
{a: [10, "foo"]}
{code}

We'll look at each of these in detail.

h4. Degenerate Lists

Consider the simplest possible list in JSON: a file that contains only an empty 
list:

{noformat}
{a: []}
{noformat}

What is the type of the list? In JSON, lists have no type, they are just lists. 
Drill requires a type, however when working with a {{REPEATED}} cardinality: 
the column must be an array of something.

Lists, however, can be a list of only nulls using the obscure {{LATE}} data 
type. That is, the list exists, but has no type. ({{LATE}} seems to suggest 
that the type will be assigned later.)

Next, consider another degenerate array:

{noformat}
{a: [null, null]}
{noformat}

Here we have an array of nulls. Again, we don't know what type these are a null 
of. Again, a LIST allows the JSON reader to produce a row with a single column 
{{`a`}} that is of type {{LIST}} that contains only the "dummy" {{LATE}} type. 
The list will indicate that we have two entries, both of which are null.

It is unclear, however, if the rest of Drill supports this concept. (DRILL-5970 
discusses a case in which an empty array, with a List of {{LATE}}, is exported 
to Parquet, producing results different than one might naively expect.)

h4. Single-type Lists

The typical JSON use case is create lists with a single type, optionally 
containing nulls:

{noformat}
{a: [10, null, 20] }
{noformat}

In this case, the List operates in its' second mode: as a list of a single 
type, such as a list of BIGINT above. Note that the list itself handles the 
nullability of each entry. (But, to be compatible with unions, the list 
actually contains a list of nullable BIGINT, so the actual implementation has 
two separate null bits. It is not clear if the code properly handles the four 
resulting null bit combinations: (null list entry, null value), (null list 
entry, not-null value), (not-null list entry, null value), (not-null list 
entry, not-null value). It may be that code assumes the two null bits to have 
the same value, but this is an awkwardness in the current design.

As JSON reads, it starts the list as a list of {{LATE}} (nulls). Only when it 
sees the first non-null value does it change the list to a list of something 
(the type for the first non-null value.)

h4. List of Union

Finally we come to the third of the list's three modes. The list starts with 0 
types, then adds one type. If union support is enabled, the list can grow to 
include other types as well, becoming a list of unions:

{noformat}
{a: [null, 10, "foo", 12.5, {b: "a map!"}, ["a", "list"] }
{noformat}

As it turns out, union support is very experimental: it works in JSON, but most 
of the rest of Drill does not support it. (More on unions later.)

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to