[
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
]
Paul Rogers edited comment on DRILL-6035 at 12/20/17 6:39 AM:
--------------------------------------------------------------
h4. JSON Arrays
Drill supports simple arrays in JSON using the following rules:
* Arrays must contain hetrogeneous elements: any of the scalars described
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner
level.
(See a later comment for nested arrays.)
For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}
h4. Schema Change in Arrays
The following will trigger errors:
{code}
{a: [10, "foo"]} // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}
h4. Nulls in Arrays
h4. Missing {{LIST}} Support
JSON arrays can contain nulls. Drill provides a (partially completed,
inoperable) {{LIST}} type as described below that handles nulls. But, this
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types
which cannot handle nulls. (The {{LIST}} type is described in a separate note
below.)
Using array types, the following rules apply to nulls:
* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That
is, an empty array and a {{null}} value are considered the same.)
h4. Late Type Identification
As described earlier, Drill 1.13 will defer picking an array type if it sees
null values. For example:
{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}
In the above example, for id=2, Drill sees column `a` but does not pick a type.
For id=3, Drill identifies that `a` is an array, but does not know the type.
Finally, for id=4, Drill identifies the array as {{BIGINT}}.
h4. Null-Only Arrays
A special case occurs if a JSON file contains only empty arrays or arrays of
nulls (such as a file that contains only the first three records above.)
In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the
List section below.) It seems that {{SqlLine}} will correctly show the null
values.
An interesting case occurs when Drill reads two files: one with an array with
only nulls, another with real values. For example:
{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}
(The above condition can occur only if JSON uses the broken {{LIST}} type; it
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A
contains:
{noformat}
{a: []}
{noformat}
Drill is distributed: one fragment will read File A, another will read File B.
At some point, the two arrays will come together. One fragment will have
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will
trigger a schema change error in this case.
Interestingly, however, if the query is a simple {{SELECT *}}, then the lists
are compatible and {{SqlLine}} will display the correct results.
In Drill 1.13, if the first batch contains only nulls and/or empty arrays,
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a
guess, a schema change will result if the guess is wrong.
was (Author: paul.rogers):
h4. JSON Arrays
Drill supports simple arrays in JSON using the following rules:
* Arrays must contain hetrogeneous elements: any of the scalars described
above, or a JSON object.
(See a later comment for nested arrays.)
For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}
h4. Schema Change in Arrays
The following will trigger errors:
{code}
{a: [10, "foo"]} // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}
h4. Nulls in Arrays
Drill handles nulls in arrays using the {{LIST}} type, described in a separate
note below.
h4. Late Type Identification
As described earlier, Drill will defer picking an array type if it sees null
values. For example:
{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}
In the above example, for id=2, Drill sees column `a` but does not pick a type.
For id=3, Drill identifies that `a` is an array, but does not know the type.
Finally, for id=4, Drill identifies the array as {{BIGINT}}.
h4. Null-Only Arrays
A special case occurs if a JSON file contains only empty arrays or arrays of
nulls (such as a file that contains only the first three records above.)
In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the
List section below.) It seems that {{SqlLine}} will correctly show the null
values.
An interesting case occurs when Drill reads two files: one with an array with
only nulls, another with real values. For example:
{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}
Drill is distributed: one fragment will read File A, another will read File B.
At some point, the two arrays will come together. One fragment will have
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will
trigger a schema change error in this case.
Interestingly, however, if the query is a simple {{SELECT *}}, then the lists
are compatible and {{SqlLine}} will display the correct results.
In Drill 1.13, if the first batch contains only nulls and/or empty arrays,
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a
guess, a schema change will result if the guess is wrong.
> Specify Drill's JSON behavior
> -----------------------------
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.13.0
> Reporter: Paul Rogers
> Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests
> that Drill may have limitations in the JSON that Drill supports. This ticket
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed
> specifications that clarifies what Drill does and does not support (or what
> is should and should not support.)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)