[
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296230#comment-16296230
]
Paul Rogers edited comment on DRILL-6035 at 12/19/17 5:30 AM:
--------------------------------------------------------------
h4. Lists
JSON supports arrays of the form:
{code}
{a: ["I'm", "an", "array"] }
{code}
Drill has two very different ways to represent arrays:
1. As a {{REPEATED}} cardinality for most data types. This gives rise to a
{{RepeatedFooVector}} for some type {{Foo}}.
2. As a {{LIST}} type with the {{ListVector}} implementation.
Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and
the {{LIST}} vector type into a single concept. Drill, however, still has two
systems.
h4. Repeated Cardinality
Drill's "go to" way to handle arrays is with the {{REPEATED}} cardinality (AKA
"repeated data mode.") Most readers that handle arrays use the {{REPEATED}}
form. To help understand the {{LIST}} type, we review {{REPEATED}} support here.
When working with a {{REPEATED} column, the rules for nulls are:
* Arrays may not contain nulls. (Drill does not support nulls as array
elements.)
* A null (or missing) array field is treated the same as an empty array.
If JSON were to use the {{REPEATED}} vectors, the following would be invalid:
{code}
[10, null, 20]
{code}
The following are all valid with {{REPEATED}} vectors:
{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}
h4. Properties of Lists
The key properties of a List relative to a repeated type are:
* Each list can be a list of nothing, a list of a single type, or a list of a
union of multiple types.
* Each list value can be null.
* Each list entry can be null (for primitive types.)
h4. Null Support
As explained below, lists support three kinds of nullability:
* The type itself can be null (a list of nulls)
* The list column value for a row can be null. (This is in contrast to repeated
types in which an array can be empty, but the entire array value cannot be
null.)
* When the list is of a primitive type, entries can be null. (The list is
defined as list of nullable items, such as nullable BIGINT.)
* When the list is of maps, the map entries *cannot* be null. (Instead, the map
columns are nullable and all columns for the "null" map are set to null.)
* When the list is of other lists, the list entry *cannot* be null. (Instead,
the nested list is empty.)
The semantics are a bit confusing when seen from the outside. They make
slightly more sense based on the implementation choices made in the code.
(Though, generally we want the code to match our requirements, not the other
way around.)
h4. Lists are Obscure
The {{LIST}} type appears to be used only for JSON, and it is unclear how well
supported it is in the rest of Drill. For example, it is not clear that
functions that work with arrays correctly handle null entries. (This needs to
be tested.)
JDBC supports array columns, but it is not clear if the Drill JDBC driver has
implemented them. ODBC doesn't support arrays at all, so whether it supports
arrays with nulls is a moot point.
h4. Lists in JSON
The {{LIST}} type appears to be used only for JSON where it is a better fit for
JSON semantics than Drill's normal {{REPEATED}} cardinality. The list type
allows list members to be null. All of the following are legal using lists:
{code}
{a: null}
{a: []}
{a: [null, null]}
{a: [null, 10, null]}
{a: [10, "foo"]}
{code}
We'll look at each of these in detail.
h4. Degenerate Lists
Consider the simplest possible list in JSON: a file that contains only an empty
list:
{noformat}
{a: []}
{noformat}
What is the type of the list? In JSON, lists have no type, they are just lists.
Drill requires a type, however when working with a {{REPEATED}} cardinality:
the column must be an array of something.
Lists, however, can be a list of only nulls using the obscure {{LATE}} data
type. That is, the list exists, but has no type. ({{LATE}} seems to suggest
that the type will be assigned later.)
Next, consider another degenerate array:
{noformat}
{a: [null, null]}
{noformat}
Here we have an array of nulls. Again, we don't know what type these are a null
of. Again, a LIST allows the JSON reader to produce a row with a single column
{{`a`}} that is of type {{LIST}} that contains only the "dummy" {{LATE}} type.
The list will indicate that we have two entries, both of which are null.
It is unclear, however, if the rest of Drill supports this concept. (DRILL-5970
discusses a case in which an empty array, with a List of {{LATE}}, is exported
to Parquet, producing results different than one might naively expect.)
h4. Single-type Lists
The typical JSON use case is create lists with a single type, optionally
containing nulls:
{noformat}
{a: [10, null, 20] }
{noformat}
In this case, the List operates in its' second mode: as a list of a single
type, such as a list of BIGINT above. Note that the list itself handles the
nullability of each entry. (But, to be compatible with unions, the list
actually contains a list of nullable BIGINT, so the actual implementation has
two separate null bits. It is not clear if the code properly handles the four
resulting null bit combinations: (null list entry, null value), (null list
entry, not-null value), (not-null list entry, null value), (not-null list
entry, not-null value). It may be that code assumes the two null bits to have
the same value, but this is an awkwardness in the current design.
As JSON reads, it starts the list as a list of {{LATE}} (nulls). Only when it
sees the first non-null value does it change the list to a list of something
(the type for the first non-null value.)
h4. List of Union
Finally we come to the third of the list's three modes. The list starts with 0
types, then adds one type. If union support is enabled, the list can grow to
include other types as well, becoming a list of unions:
{noformat}
{a: [null, 10, "foo", 12.5, {b: "a map!"}, ["a", "list"] }
{noformat}
As it turns out, union support is very experimental: it works in JSON, but most
of the rest of Drill does not support it. (More on unions later.)
was (Author: paul.rogers):
h4. Lists
JSON supports arrays of the form:
{code}
{a: ["I'm", "an", "array"] }
{code}
Drill has two very different ways to represent arrays:
1. As a {{REPEATED}} cardinality for most data types. This gives rise to a
{{RepeatedFooVector}} for some type {{Foo}}.
2. As a {{LIST}} type with the {{ListVector}} implementation.
Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and
the {{LIST}} vector type into a single concept. Drill, however, still has two
systems.
The key properties of a List relative to a repeated type are:
* Each list can be a list of nothing, a list of a single type, or a list of a
union of multiple types.
* Each list value can be null.
* Each list entry can be null (for primitive types.)
h4. Null Support
As explained below, lists support three kinds of nullability:
* The type itself can be null (a list of nulls)
* The list column value for a row can be null. (This is in contrast to repeated
types in which an array can be empty, but the entire array value cannot be
null.)
* When the list is of a primitive type, entries can be null. (The list is
defined as list of nullable items, such as nullable BIGINT.)
* When the list is of maps, the map entries *cannot* be null. (Instead, the map
columns are nullable and all columns for the "null" map are set to null.)
* When the list is of other lists, the list entry *cannot* be null. (Instead,
the nested list is empty.)
The semantics are a bit confusing when seen from the outside. They make
slightly more sense based on the implementation choices made in the code.
(Though, generally we want the code to match our requirements, not the other
way around.)
h4. Lists are Obscure
The {{LIST}} type appears to be used only for JSON, and it is unclear how well
supported it is in the rest of Drill. For example, it is not clear that
functions that work with arrays correctly handle null entries. (This needs to
be tested.)
JDBC supports array columns, but it is not clear if the Drill JDBC driver has
implemented them. ODBC doesn't support arrays at all, so whether it supports
arrays with nulls is a moot point.
h4. Lists in JSON
The {{LIST}} type appears to be used only for JSON where it is a better fit for
JSON semantics than Drill's normal {{REPEATED}} cardinality. The list type
allows list members to be null. All of the following are legal using lists:
{code}
{a: null}
{a: []}
{a: [null, null]}
{a: [null, 10, null]}
{a: [10, "foo"]}
{code}
We'll look at each of these in detail.
h4. Degenerate Lists
Consider the simplest possible list in JSON: a file that contains only an empty
list:
{noformat}
{a: []}
{noformat}
What is the type of the list? In JSON, lists have no type, they are just lists.
Drill requires a type, however when working with a {{REPEATED}} cardinality:
the column must be an array of something.
Lists, however, can be a list of only nulls using the obscure {{LATE}} data
type. That is, the list exists, but has no type. ({{LATE}} seems to suggest
that the type will be assigned later.)
Next, consider another degenerate array:
{noformat}
{a: [null, null]}
{noformat}
Here we have an array of nulls. Again, we don't know what type these are a null
of. Again, a LIST allows the JSON reader to produce a row with a single column
{{`a`}} that is of type {{LIST}} that contains only the "dummy" {{LATE}} type.
The list will indicate that we have two entries, both of which are null.
It is unclear, however, if the rest of Drill supports this concept. (DRILL-5970
discusses a case in which an empty array, with a List of {{LATE}}, is exported
to Parquet, producing results different than one might naively expect.)
h4. Single-type Lists
The typical JSON use case is create lists with a single type, optionally
containing nulls:
{noformat}
{a: [10, null, 20] }
{noformat}
In this case, the List operates in its' second mode: as a list of a single
type, such as a list of BIGINT above. Note that the list itself handles the
nullability of each entry. (But, to be compatible with unions, the list
actually contains a list of nullable BIGINT, so the actual implementation has
two separate null bits. It is not clear if the code properly handles the four
resulting null bit combinations: (null list entry, null value), (null list
entry, not-null value), (not-null list entry, null value), (not-null list
entry, not-null value). It may be that code assumes the two null bits to have
the same value, but this is an awkwardness in the current design.
As JSON reads, it starts the list as a list of {{LATE}} (nulls). Only when it
sees the first non-null value does it change the list to a list of something
(the type for the first non-null value.)
h4. List of Union
Finally we come to the third of the list's three modes. The list starts with 0
types, then adds one type. If union support is enabled, the list can grow to
include other types as well, becoming a list of unions:
{noformat}
{a: [null, 10, "foo", 12.5, {b: "a map!"}, ["a", "list"] }
{noformat}
As it turns out, union support is very experimental: it works in JSON, but most
of the rest of Drill does not support it. (More on unions later.)
> Specify Drill's JSON behavior
> -----------------------------
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.13.0
> Reporter: Paul Rogers
> Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests
> that Drill may have limitations in the JSON that Drill supports. This ticket
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed
> specifications that clarifies what Drill does and does not support (or what
> is should and should not support.)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)