[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/16/17 2:01 AM:
--------------------------------------------------------------

h4. JSON Scalars

Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

h4. Nulls with Scalar Values

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}

h4. BIGINT vs. FLOAT8 Types

JSON has a single numeric type which can hold integers, floats or decimals. 
Drill attemps to infer the type of the field from the format of the first 
number encountered for a field using the following rules:

* If the number consists entirely of digits, it is an integer, stored as 
{{BIGINT}}.
* Else, it is a floating point number stored as {{FLOAT8}}.

Drill 1.13 adds the following rule:

* If a number is {{FLOAT8}}, accept an integer, but promote it to {{FLOAT8}}.

That is, the following array is stored as {{FLOAT8}}:

{code}
[1.2, 3, 5.0]
{code}

Drill is easily fooled, however, rearrange the values and Drill will raise an 
error:

{code}
[3, 1.2, 5.0]
{code}

In the second case, Drill sees 3, guesses {{BIGINT}}, then immediately fails 
when trying to store the float value 1.2.




was (Author: paul.rogers):
h4. JSON Scalars

Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

h4. Nulls with Scalar Values

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to