[jira] [Commented] (DRILL-4824) JSON with complex nested data produces incorrect output with missing fields

Paul Rogers (JIRA) Wed, 08 Feb 2017 09:36:07 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858294#comment-15858294
 ]


Paul Rogers commented on DRILL-4824:
------------------------------------

Let’s step back to establish if we want a correct solution, or just a less-bad 
workaround. In this JIRA, we’ve been talking about workarounds.

The fundamental problem is that Drill discards information important to JSON. 
In JSON, a field can have multiple states: not-provided, null, map (perhaps 
empty), list (perhaps empty), number, string.

Drill cannot represent the following:

* Variable types
* Not-provided
* Null map
* Null array

As a result, we try to “compress” the JSON states into the smaller set of Drill 
states. To have an accurate solution, we must make (at least) three changes:

* Add a null bit to map and array (Go from MapVector to NullableMapVector, etc.)
* Include the not-provided bit.
* Support variant (union) vectors.

The good news is that Drill already provides the essential pieces.

* Drill provides null flag vectors for other vectors. There is nothing (other 
than work) preventing us from adding them to maps and arrays, and lists.
* While we call the nullable flag a “bit” vector, we actually use an entire 
byte per record. As a result, we can simply grab one of the seven unused bits 
to use as a “not-provided” bit.
* Drill provides a (partial implementation of) a variant (or “union") vector.

Building on those three components, we can achieve complete support of the JSON 
standard. 

> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>
>                 Key: DRILL-4824
>                 URL: https://issues.apache.org/jira/browse/DRILL-4824
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - JSON
>    Affects Versions: 1.0.0
>            Reporter: Roman
>            Assignee: Serhii Harnyk
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
>         "Field1" : {
>         }
> }
> {
>         "Field1" : {
>                 "InnerField1": {"key1":"value1"},
>                 "InnerField2": {"key2":"value2"}
>         }
> }
> {
>         "Field1" : {
>                 "InnerField3" : ["value3", "value4"],
>                 "InnerField4" : ["value5", "value6"]
>         }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> |          Field1           |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" 
> {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested 
> structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> |         Field1           |
> +--------------------------+
> |{}                                                                     
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-4824) JSON with complex nested data produces incorrect output with missing fields

Reply via email to