[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293521#comment-16293521
 ] 

Paul Rogers commented on DRILL-6035:
------------------------------------

h4. Object Key Names

The [JSON Standard|https://www.json.org] identifies an object as:

{code}
{ (string: value)* }
{code}

That is, they key portion of the name/value pair can be an arbitrary string, 
encoded in UTF-8.

Drill follows the SQL rules for names in SQL statements:

* Names are case insensitive
* Names must follow certain syntax rules (but those roles can be skipped if the 
name is enclosed in back-ticks.)
* Names must consist of at least a single character

Drill rules for names in JSON are:

* Names need not be quoted if they are unambiguous.
* Names are considered case insensitive for comparison purposes.

When determining column names:

* If the query includes a {{SELECT *}}, Drill uses the names (and case) 
specified in JSON.
* If the query includes an explicit projection, {{SELECT x, y z}}, then Drill 
uses the names and case specified in the SQL. That is, even if the JSON field 
names are "X", "Y" and "Z", Drill will still name the columns `x`, `y` and `z`.

h4. Case Sensitivity Conflicts

The above set up conflicts between JSON and Drill naming rules:

* The names "a" and "A" are distinct in JSON, identical in Drill.
* The string "" is a valid key in JSON, but an invalid name in Drill.

Although Drill allows the use of back-ticks to escape non-standard names, this 
syntax cannot be used to overcome Drill's case insensitivity. That is, a of the 
following match either "x" or "X":

* {{x}}
* {{X}}
* {{`x`}}
* {{`X`}}

If Drill is presented with a JSON document with names that differ only in case, 
then the last name wins. That is, given this input:

{code}
{x: 10, X: 20}
{code}

Drill (in Version 1.13) will not notice the "duplicate" name, but will rather 
simply overwrite the first "x" with the second "X", producing a single column 
"x" with the value 20 for the first record. In this, Drill follows RFC-7159: 
"When the names within an object are not unique, the behavior of software that 
receives such an object is unpredictable.  Many implementations report the last 
name/value pair only."

h4. Empty Names

JSON allows an empty key name:

{code}
{"": 10}
{code}

Drill (in 1.13) will raise an error in this situation.

h4. Leading and Trailing Spaces

JSON keys are arbitrary keys, which means JSON allows keys with leading and 
trailing spaces:

{code}
{" a": 10, " b": 20, " c ": 30}
{code}

Drill (in 1.13) strips leading and trailing spaces. Thus, a name that consists 
only of spaces is considered to be empty. The three names shown above are 
considered to `a`, `b` and `c`.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to