[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

Paul Rogers (JIRA) Fri, 15 Dec 2017 14:01:02 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293298#comment-16293298
 ]


Paul Rogers edited comment on DRILL-6035 at 12/15/17 9:59 PM:
--------------------------------------------------------------

h4. JSON Objects

Drill support for JSON objects consists of three parts.

* Drill supports a non-standard serialized JSON format in which an input file 
is a sequence of JSON objects.
* Values within the top-level JSON object give rise to columns within the Drill 
row.
* Nested objects in JSON give rise to nested {{MAP}} columns in Drill.
* Arrays of JSON objects give rise to {{REPEATED MAP}} columns in Drill.

h4. Top-Level JSON Objects

Example of the expected JSON input format:

{code}
{a: 10}
{a: 20}
{a: 30}
{code}

Drill allows any amount of white-space between objects. It is common to place 
each object on a new line, though this is not required.

Note that Drill allows value names to be quoted or unquoted. The following are 
both valid:

{code}
{"a": 10}
{a: 20}
{code}

h4. Drill MAP type

Drill uses the term "Map" to describe JSON objects. However, the {{MAP}} type 
in Drill is closer to the {{STRUCT}} type in Impala and Hive. That is, like a 
{{STRUCT}}, the schema of all map instances is identical across rows. (This is 
unlike, say, a JSON object or Python map in which the members of one instance 
are independent of those in any other instance.)

As a result, the following example:

{code}
{a: {x: 10}}
{a: {y: 20}}
{code}

Gives rise to records in Drill with data similar to the following:

{code}
{a: {x: 10, y: null}}
{a: {x: null, y: 20}}
{code}

h4. JSON Object Arrays and Drill Repeated MAPs

The example below shows a repeated object which gives rise to a {{REPEATED 
MAP}}:

{code}
{a: [{b: 10}, {b: 20}]}
{code}

h4. Nulls with JSON Objects

JSON allows null values for a map:

{code}
{id: 1, a: {x: 10}}
{id: 2, a: null}
{id: 3}
{code}

Drill does not support the concept of a "nullable map". Instead, Drill defines 
all map members to be nullable. If the entire object is null (or missing) in 
JSON, Drill treats this the same as if every member were null. Thus, in Drill, 
the following are all equivalent:

{code}
{id: 1, a: {x: null, y: null}}
{id: 2, a: {}}
{id: 3, a: {x: null}}
{id: 4, a: null}
{id: 5}
{code}

As a side note, when exporting the above data to a JSON file, Drill cannot 
recreate the original structure. Instead, it writes all of the above in a 
common format. (The format has evolved based on previous bugs, need to 
investigate the current choice.)

As described for scalars, Drill will defer selecting a type for a column if the 
initial records consist only of null values. If a later value is revealed to be 
a map, Drill will choose the map type. If the file (or first batch) consists 
only of nulls, then Drill cannot know the type and guesses {{VARCHAR}}. This 
will lead to a schema change error if a later file (or batch) reveals the type 
to actually be a map (since {{VARCHAR}} and {{MAP}} are not compatible.)

{code}
{id: 1} {id: 2, a: null} {id: 3, a: null}
{id: 4, a: {x: 10, y: 20}}
{code}


was (Author: paul.rogers):
Drill support for JSON objects consists of three parts.

* Drill supports a non-standard serialized JSON format in which an input file 
is a sequence of JSON objects.
* Values within the top-level JSON object give rise to columns within the Drill 
row.
* Nested objects in JSON give rise to nested {{MAP}} columns in Drill.
* Arrays of JSON objects give rise to {{REPEATED MAP}} columns in Drill.

h4. Top-Level JSON Objects

Example of the expected JSON input format:

{code}
{a: 10}
{a: 20}
{a: 30}
{code}

Drill allows any amount of white-space between objects. It is common to place 
each object on a new line, though this is not required.

Note that Drill allows value names to be quoted or unquoted. The following are 
both valid:

{code}
{"a": 10}
{a: 20}
{code}

h4. Drill MAP type

Drill uses the term "Map" to describe JSON objects. However, the {{MAP}} type 
in Drill is closer to the {{STRUCT}} type in Impala and Hive. That is, like a 
{{STRUCT}}, the schema of all map instances is identical across rows. (This is 
unlike, say, a JSON object or Python map in which the members of one instance 
are independent of those in any other instance.)

As a result, the following example:

{code}
{a: {x: 10}}
{a: {y: 20}}
{code}

Gives rise to records in Drill with data similar to the following:

{code}
{a: {x: 10, y: null}}
{a: {x: null, y: 20}}
{code}

h4. JSON Object Arrays and Drill Repeated MAPs

The example below shows a repeated object which gives rise to a {{REPEATED 
MAP}}:

{code}
{a: [{b: 10}, {b: 20}]}
{code}

h4. Nulls with JSON Objects

JSON allows null values for a map:

{code}
{id: 1, a: {x: 10}}
{id: 2, a: null}
{id: 3}
{code}

Drill does not support the concept of a "nullable map". Instead, Drill defines 
all map members to be nullable. If the entire object is null (or missing) in 
JSON, Drill treats this the same as if every member were null. Thus, in Drill, 
the following are all equivalent:

{code}
{id: 1, a: {x: null, y: null}}
{id: 2, a: {}}
{id: 3, a: {x: null}}
{id: 4, a: null}
{id: 5}
{code}

As a side note, when exporting the above data to a JSON file, Drill cannot 
recreate the original structure. Instead, it writes all of the above in a 
common format. (The format has evolved based on previous bugs, need to 
investigate the current choice.)

As described for scalars, Drill will defer selecting a type for a column if the 
initial records consist only of null values. If a later value is revealed to be 
a map, Drill will choose the map type. If the file (or first batch) consists 
only of nulls, then Drill cannot know the type and guesses {{VARCHAR}}. This 
will lead to a schema change error if a later file (or batch) reveals the type 
to actually be a map (since {{VARCHAR}} and {{MAP}} are not compatible.)

{code}
{id: 1} {id: 2, a: null} {id: 3, a: null}
{id: 4, a: {x: 10, y: 20}}
{code}

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

Reply via email to