[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305681#comment-16305681
 ] 

Paul Rogers commented on DRILL-6035:
------------------------------------

h4. Heterogeneous Types in JSON

JSON is a universal data format and has no rules about how data can be 
structured. As a result, JSON supports a wide variety of use cases far beyond 
the relational model used by Drill. For example, the following are perfectly 
fine JSON:

{code}
{a: 10}
{a: {type: "int", value: 10}}
{a: [10, 20]}
{a: "30"}
{code}

In the above, {{a}} takes on a variety of types. In some applications the above 
may even be useful.

Drill, however, follows the relational model which requires that each column 
have a (single) declared type.

h4. Drill Union Type

Drill has partial support for a "union" type. ("Union" in the sense of the C 
language: values of multiple types that share a storage location.) If you are 
familiar with the old Visual Basic, then the "variant" type in VB is another 
example.

In a union, each record has a single value, but the type of the value may vary 
between records. The union type can represent the examples shown above.

The union type, however, does not fit well with the relational model. Only JSON 
supports it; most other Drill operators do not. Still some users do enable the 
union type, so we must understand the semantics.

The union type must be enabled (for all JSON queries) by setting a session 
option:

{code}
ALTER SESSION SET `exec.enable_union_type` = true
{code}

h4. Type Promotion

When the union type is enabled, Drill 1.12 supports type promotion:

* When Drill first sees a JSON key/value pair, Drill infers the type as 
described above.
* In non-union operation, if Drill sees a conflicting type, Drill will issue a 
schema exception.
* In union-enabled operation, if Drill sees a conflicting type, Drill will 
"promote" the column to a union type, then store the new type as a member of 
the now-union column.

Drill provide no logging or other information when promotion occurs. Given that 
most operators do not support unions, it is a bit fo a crap shoot as to whether 
the above will actually work in any given query.

h4. Type Promotion in Lists

Drill has an incomplete {{ListVector}} implementation which is, in essence, a 
list of unions. (Drill 1.13 got the {{ListVector}} to work in JSON, but it is 
not supported elsewhere and so is not yet usable.) Type promotion in lists is a 
bit more complex:

* In non-union mode, Drill uses repeated types (as described above) for JSON 
lists and type promotion is disallowed. (Lists could be used independent of 
union mode, in single-type mode, as described earlier, but this is not yet 
supported due to the {{ListVector}} being incomplete.)
* In union mode, Drill uses the {{ListVector}}. It starts as a single-type list 
(which does not actually use a union.)
* In union mode, if a conflicting type is seen, the {{ListVector}} is promoted 
to union mode, the additional type is added, and the list becomes heterogeneous.

h4. Practical JSON

Given that unions are very unstable (and not well understood), the practical 
solution is to take care to use the same JSON type for each occurrence of a 
given key.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to