[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:11 PM:
--------------------------------------------------------------

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Multi-dimensional arrays can contain nulls at the outer level but not the 
inner level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type.
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

Thus, the following is valid JSON, but is not supported by Drill:

{code}
{a: [10, null, 20] }
{code}

The general rule is that Drill does not support nulls in one-dimensional JSON 
arrays. (See below for the 2+D case.)

h4. "Deferred" Type Definition in Drill 1.13

Drill 1.13 (but likely not Drill 1.12) can handle some ambiguity in arrays:

{code}
{a: 1}
{a: 2, b: null}
{a: 3, b: []}
{a: 4, b: [10, 20]}
{code}

Here, the first three records are ambiguous. The first has no entry for {{b}}, 
the second says that {{b}} is {{null}}, but not the type of {{null}} (In JSON, 
{{null}} is untyped, while in Drill {{NULL}} must have a type.) In the third 
record we learn that {{b}} is an array, but do not yet know the type of the 
array. Finally, in record four we learn that {{b}} is an array of {{BIGINT}}.

This mechanism works only if the type is revealed in the first batch. If the 
type does not appear, Drill must guess a type, will guess array of text-mode. 
This works within the same file if the array turns out to be, say, integers. 
But, it will probably cause a schema conflict across files that see the 
integers within the first batch. (The general rule is that Drill cannot predict 
the future.)

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

h4. Multi-Dimensional Arrays

JSON supports lists (arrays) of arbitrary dimensions. Drill supports this via 
two types:

* A base array in the form of a repeated cardinality of any primitive type or 
map. (The so-called “Repeated mode.”)
* A extended array of base or extended arrays. (The so-called “repeated list.”)

Drill supports multi-dimensional arrays of a single type (including maps). 
Unlike JSON:

* Drill arrays cannot hold nulls.
* Arrays themselves cannot be null (though they can be empty.)

Note that support for lists is via the {{RepeatedListVector}}. Note that this 
is not simply the {{REPEATED}} form of the {{ListVector}}; it is something else 
entirely. The {{ListVector}} is not supported at present (see below), but the 
{{RepeatedListVector}} is.

For example:

{code}
{a: [[10, 20], [30, 40]]}
{code}

Drill (at least in 1.13) allows nulls in the outer dimension of an array:

{code}
{a: [[10, 20], null, [30, 40]]}
{code}

A null is the same as an empty array:

{code}
{a: [[10, 20], [ ], [30, 40]]}
{code}

Drill 1.13 allows "ambiguity" in arrays before seeing an actual value:
{code}
{a: null}
{a: [ ]}
{a: [ [ ] ]}
{a: [[10, 20], null, [30, 40]]}
{code}

The caveat is that the value must appear within the first batch. Otherwise, the 
first batch must guess at a type, will guess text mode, and will conflict with 
later values if they are arrays or maps.

Note that Drill *cannot* handle leading nulls before the type is seen. The 
following is invalid:

{code}
{a: [null, [10, 20], [30, 40]]}
{code}

The reason is that Drill does not yet know the type at the time that the null 
appears, and so does not know whether to treat it as an empty array or as a 
null scalar (and, of course, nulls are not supported in scalar arrays.)


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Multi-dimensional arrays can contain nulls at the outer level but not the 
inner level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

The general rule is that Drill does not support nulls in JSON arrays. However, 
there are many odd cases.

If the JSON array is 2+ dimensional, then Drill supports nulls in outer 
dimensions but not inner ones. That is, the following is legal:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{code}

Here, {{null}} behaves the same as an empty array: {{[ ]}}.

Note, however, that the above is *not* valid if the nulls come before the first 
non-null value:

{code}
{a: [null, ["not", "allowed"]]}
{code}

The reason is that, in the previous example, Drill could determine that the 
outer list is a list of string arrays. But, in the second case, Drill cannot 
tell the array type when it sees the {{null}}: it might be an array of scalars, 
so {{null}} is not allowed.

That said, the following *is* allowed:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{a: [null, ["is", "allowed"]]}
{code}

Here, the first array showed the type; Drill remembers that type for the second 
array.

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.

h4. Multi-Dimensional Arrays

Drill supports multi-dimensional arrays of a single type (including maps). For 
example:

{code}
{a: [[10, 20], [30, 40]]}
{code}

Drill (at least in 1.13) allows nulls in the outer dimension of an array:

{code}
{a: [[10, 20], null, [30, 40]]}
{code}

A null is the same as an empty array:

{code}
{a: [[10, 20], [ ], [30, 40]]}
{code}

Drill 1.13 allows "ambiguity" in arrays before seeing an actual value:
{code}
{a: null}
{a: [ ]}
{a: [ [ ] ]}
{a: [[10, 20], null, [30, 40]]}
{code}

The caveat is that the value must appear within the first batch. Otherwise, the 
first batch must guess at a type, will guess text mode, and will conflict with 
later values if they are arrays or maps.

Note that Drill *cannot* handle leading nulls before the type is seen. The 
following is invalid:

{code}
{a: [null, [10, 20], [30, 40]]}
{code}

The reason is that Drill does not yet know the type at the time that the null 
appears, and so does not know whether to treat it as an empty array or as a 
null scalar (and, of course, nulls are not supported in scalar arrays.)

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to