[
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
]
Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:11 PM:
--------------------------------------------------------------
h4. JSON Arrays
Drill supports simple arrays in JSON using the following rules:
* Arrays must contain hetrogeneous elements: any of the scalars described
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Multi-dimensional arrays can contain nulls at the outer level but not the
inner level.
(See a later comment for nested arrays.)
For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}
h4. Schema Change in Arrays
The following will trigger errors:
{code}
{a: [10, "foo"]} // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}
h4. Nulls in Arrays
Using array types, the following rules apply to nulls:
* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type.
* An entire array can be null, which is represented as an empty array. (That
is, an empty array and a {{null}} value are considered the same.)
Thus, the following is valid JSON, but is not supported by Drill:
{code}
{a: [10, null, 20] }
{code}
The general rule is that Drill does not support nulls in one-dimensional JSON
arrays. (See below for the 2+D case.)
h4. "Deferred" Type Definition in Drill 1.13
Drill 1.13 (but likely not Drill 1.12) can handle some ambiguity in arrays:
{code}
{a: 1}
{a: 2, b: null}
{a: 3, b: []}
{a: 4, b: [10, 20]}
{code}
Here, the first three records are ambiguous. The first has no entry for {{b}},
the second says that {{b}} is {{null}}, but not the type of {{null}} (In JSON,
{{null}} is untyped, while in Drill {{NULL}} must have a type.) In the third
record we learn that {{b}} is an array, but do not yet know the type of the
array. Finally, in record four we learn that {{b}} is an array of {{BIGINT}}.
This mechanism works only if the type is revealed in the first batch. If the
type does not appear, Drill must guess a type, will guess array of text-mode.
This works within the same file if the array turns out to be, say, integers.
But, it will probably cause a schema conflict across files that see the
integers within the first batch. (The general rule is that Drill cannot predict
the future.)
h4. Missing {{LIST}} Support
JSON arrays can contain nulls. Drill provides a (partially completed,
inoperable) {{LIST}} type as described below that handles nulls. But, this
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types
which cannot handle nulls. (The {{LIST}} type is described in a separate note
below.)
h4. Multi-Dimensional Arrays
JSON supports lists (arrays) of arbitrary dimensions. Drill supports this via
two types:
* A base array in the form of a repeated cardinality of any primitive type or
map. (The so-called “Repeated mode.”)
* A extended array of base or extended arrays. (The so-called “repeated list.”)
Drill supports multi-dimensional arrays of a single type (including maps).
Unlike JSON:
* Drill arrays cannot hold nulls.
* Arrays themselves cannot be null (though they can be empty.)
Note that support for lists is via the {{RepeatedListVector}}. Note that this
is not simply the {{REPEATED}} form of the {{ListVector}}; it is something else
entirely. The {{ListVector}} is not supported at present (see below), but the
{{RepeatedListVector}} is.
For example:
{code}
{a: [[10, 20], [30, 40]]}
{code}
Drill (at least in 1.13) allows nulls in the outer dimension of an array:
{code}
{a: [[10, 20], null, [30, 40]]}
{code}
A null is the same as an empty array:
{code}
{a: [[10, 20], [ ], [30, 40]]}
{code}
Drill 1.13 allows "ambiguity" in arrays before seeing an actual value:
{code}
{a: null}
{a: [ ]}
{a: [ [ ] ]}
{a: [[10, 20], null, [30, 40]]}
{code}
The caveat is that the value must appear within the first batch. Otherwise, the
first batch must guess at a type, will guess text mode, and will conflict with
later values if they are arrays or maps.
Note that Drill *cannot* handle leading nulls before the type is seen. The
following is invalid:
{code}
{a: [null, [10, 20], [30, 40]]}
{code}
The reason is that Drill does not yet know the type at the time that the null
appears, and so does not know whether to treat it as an empty array or as a
null scalar (and, of course, nulls are not supported in scalar arrays.)
was (Author: paul.rogers):
h4. JSON Arrays
Drill supports simple arrays in JSON using the following rules:
* Arrays must contain hetrogeneous elements: any of the scalars described
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Multi-dimensional arrays can contain nulls at the outer level but not the
inner level.
(See a later comment for nested arrays.)
For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}
h4. Schema Change in Arrays
The following will trigger errors:
{code}
{a: [10, "foo"]} // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}
h4. Nulls in Arrays
The general rule is that Drill does not support nulls in JSON arrays. However,
there are many odd cases.
If the JSON array is 2+ dimensional, then Drill supports nulls in outer
dimensions but not inner ones. That is, the following is legal:
{code}
{a: [["null", "is"], null, ["allowed"]]}
{code}
Here, {{null}} behaves the same as an empty array: {{[ ]}}.
Note, however, that the above is *not* valid if the nulls come before the first
non-null value:
{code}
{a: [null, ["not", "allowed"]]}
{code}
The reason is that, in the previous example, Drill could determine that the
outer list is a list of string arrays. But, in the second case, Drill cannot
tell the array type when it sees the {{null}}: it might be an array of scalars,
so {{null}} is not allowed.
That said, the following *is* allowed:
{code}
{a: [["null", "is"], null, ["allowed"]]}
{a: [null, ["is", "allowed"]]}
{code}
Here, the first array showed the type; Drill remembers that type for the second
array.
h4. Missing {{LIST}} Support
JSON arrays can contain nulls. Drill provides a (partially completed,
inoperable) {{LIST}} type as described below that handles nulls. But, this
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types
which cannot handle nulls. (The {{LIST}} type is described in a separate note
below.)
Using array types, the following rules apply to nulls:
* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That
is, an empty array and a {{null}} value are considered the same.)
h4. Late Type Identification
As described earlier, Drill 1.13 will defer picking an array type if it sees
null values. For example:
{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}
In the above example, for id=2, Drill sees column `a` but does not pick a type.
For id=3, Drill identifies that `a` is an array, but does not know the type.
Finally, for id=4, Drill identifies the array as {{BIGINT}}.
h4. Null-Only Arrays
A special case occurs if a JSON file contains only empty arrays or arrays of
nulls (such as a file that contains only the first three records above.)
In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the
List section below.) It seems that {{SqlLine}} will correctly show the null
values.
An interesting case occurs when Drill reads two files: one with an array with
only nulls, another with real values. For example:
{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}
(The above condition can occur only if JSON uses the broken {{LIST}} type; it
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A
contains:
{noformat}
{a: []}
{noformat}
Drill is distributed: one fragment will read File A, another will read File B.
At some point, the two arrays will come together. One fragment will have
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will
trigger a schema change error in this case.
Interestingly, however, if the query is a simple {{SELECT *}}, then the lists
are compatible and {{SqlLine}} will display the correct results.
In Drill 1.13, if the first batch contains only nulls and/or empty arrays,
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a
guess, a schema change will result if the guess is wrong.
h4. Multi-Dimensional Arrays
Drill supports multi-dimensional arrays of a single type (including maps). For
example:
{code}
{a: [[10, 20], [30, 40]]}
{code}
Drill (at least in 1.13) allows nulls in the outer dimension of an array:
{code}
{a: [[10, 20], null, [30, 40]]}
{code}
A null is the same as an empty array:
{code}
{a: [[10, 20], [ ], [30, 40]]}
{code}
Drill 1.13 allows "ambiguity" in arrays before seeing an actual value:
{code}
{a: null}
{a: [ ]}
{a: [ [ ] ]}
{a: [[10, 20], null, [30, 40]]}
{code}
The caveat is that the value must appear within the first batch. Otherwise, the
first batch must guess at a type, will guess text mode, and will conflict with
later values if they are arrays or maps.
Note that Drill *cannot* handle leading nulls before the type is seen. The
following is invalid:
{code}
{a: [null, [10, 20], [30, 40]]}
{code}
The reason is that Drill does not yet know the type at the time that the null
appears, and so does not know whether to treat it as an empty array or as a
null scalar (and, of course, nulls are not supported in scalar arrays.)
> Specify Drill's JSON behavior
> -----------------------------
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.13.0
> Reporter: Paul Rogers
> Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests
> that Drill may have limitations in the JSON that Drill supports. This ticket
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed
> specifications that clarifies what Drill does and does not support (or what
> is should and should not support.)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)