[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/28/17 6:42 PM:
--------------------------------------------------------------

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Multi-dimensional arrays can contain nulls at the outer level but not the 
inner level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

The general rule is that Drill does not support nulls in JSON arrays. However, 
there are many odd cases.

If the JSON array is 2+ dimensional, then Drill supports nulls in outer 
dimensions but not inner ones. That is, the following is legal:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{code}

Here, {{null}} behaves the same as an empty array: {{[ ]}}.

Note, however, that the above is *not* valid if the nulls come before the first 
non-null value:

{code}
{a: [null, ["not", "allowed"]]}
{code}

The reason is that, in the previous example, Drill could determine that the 
outer list is a list of string arrays. But, in the second case, Drill cannot 
tell the array type when it sees the {{null}}: it might be an array of scalars, 
so {{null}} is not allowed.

That said, the following *is* allowed:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{a: [null, ["is", "allowed"]]}
{code}

Here, the first array showed the type; Drill remembers that type for the second 
array.

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.

h4. Multi-Dimensional Arrays

Drill supports multi-dimensional arrays of a single type (including maps). For 
example:

{code}
{a: [[10, 20], [30, 40]]}
{code}

Drill (at least in 1.13) allows nulls in the outer dimension of an array:

{code}
{a: [[10, 20], null, [30, 40]]}
{code}

A null is the same as an empty array:

{code}
{a: [[10, 20], [ ], [30, 40]]}
{code}

Drill 1.13 allows "ambiguity" in arrays before seeing an actual value:
{code}
{a: null}
{a: [ ]}
{a: [ [ ] ]}
{a: [[10, 20], null, [30, 40]]}
{code}

The caveat is that the value must appear within the first batch. Otherwise, the 
first batch must guess at a type, will guess text mode, and will conflict with 
later values if they are arrays or maps.

Note that Drill *cannot* handle leading nulls before the type is seen. The 
following is invalid:

{code}
{a: [null, [10, 20], [30, 40]]}
{code}

The reason is that Drill does not yet know the type at the time that the null 
appears, and so does not know whether to treat it as an empty array or as a 
null scalar (and, of course, nulls are not supported in scalar arrays.)


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

The general rule is that Drill does not support nulls in JSON arrays. However, 
there are many odd cases.

If the JSON array is 2+ dimensional, then Drill supports nulls in outer 
dimensions but not inner ones. That is, the following is legal:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{code}

Here, {{null}} behaves the same as an empty array: {{[ ]}}.

Note, however, that the above is *not* valid if the nulls come before the first 
non-null value:

{code}
{a: [null, ["not", "allowed"]]}
{code}

The reason is that, in the previous example, Drill could determine that the 
outer list is a list of string arrays. But, in the second case, Drill cannot 
tell the array type when it sees the {{null}}: it might be an array of scalars, 
so {{null}} is not allowed.

That said, the following *is* allowed:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{a: [null, ["is", "allowed"]]}
{code}

Here, the first array showed the type; Drill remembers that type for the second 
array.

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to