[
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305681#comment-16305681
]
Paul Rogers commented on DRILL-6035:
------------------------------------
h4. Heterogeneous Types in JSON
JSON is a universal data format and has no rules about how data can be
structured. As a result, JSON supports a wide variety of use cases far beyond
the relational model used by Drill. For example, the following are perfectly
fine JSON:
{code}
{a: 10}
{a: {type: "int", value: 10}}
{a: [10, 20]}
{a: "30"}
{code}
In the above, {{a}} takes on a variety of types. In some applications the above
may even be useful.
Drill, however, follows the relational model which requires that each column
have a (single) declared type.
h4. Drill Union Type
Drill has partial support for a "union" type. ("Union" in the sense of the C
language: values of multiple types that share a storage location.) If you are
familiar with the old Visual Basic, then the "variant" type in VB is another
example.
In a union, each record has a single value, but the type of the value may vary
between records. The union type can represent the examples shown above.
The union type, however, does not fit well with the relational model. Only JSON
supports it; most other Drill operators do not. Still some users do enable the
union type, so we must understand the semantics.
The union type must be enabled (for all JSON queries) by setting a session
option:
{code}
ALTER SESSION SET `exec.enable_union_type` = true
{code}
h4. Type Promotion
When the union type is enabled, Drill 1.12 supports type promotion:
* When Drill first sees a JSON key/value pair, Drill infers the type as
described above.
* In non-union operation, if Drill sees a conflicting type, Drill will issue a
schema exception.
* In union-enabled operation, if Drill sees a conflicting type, Drill will
"promote" the column to a union type, then store the new type as a member of
the now-union column.
Drill provide no logging or other information when promotion occurs. Given that
most operators do not support unions, it is a bit fo a crap shoot as to whether
the above will actually work in any given query.
h4. Type Promotion in Lists
Drill has an incomplete {{ListVector}} implementation which is, in essence, a
list of unions. (Drill 1.13 got the {{ListVector}} to work in JSON, but it is
not supported elsewhere and so is not yet usable.) Type promotion in lists is a
bit more complex:
* In non-union mode, Drill uses repeated types (as described above) for JSON
lists and type promotion is disallowed. (Lists could be used independent of
union mode, in single-type mode, as described earlier, but this is not yet
supported due to the {{ListVector}} being incomplete.)
* In union mode, Drill uses the {{ListVector}}. It starts as a single-type list
(which does not actually use a union.)
* In union mode, if a conflicting type is seen, the {{ListVector}} is promoted
to union mode, the additional type is added, and the list becomes heterogeneous.
h4. Practical JSON
Given that unions are very unstable (and not well understood), the practical
solution is to take care to use the same JSON type for each occurrence of a
given key.
> Specify Drill's JSON behavior
> -----------------------------
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.13.0
> Reporter: Paul Rogers
> Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests
> that Drill may have limitations in the JSON that Drill supports. This ticket
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed
> specifications that clarifies what Drill does and does not support (or what
> is should and should not support.)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)