[
https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858294#comment-15858294
]
Paul Rogers commented on DRILL-4824:
------------------------------------
Let’s step back to establish if we want a correct solution, or just a less-bad
workaround. In this JIRA, we’ve been talking about workarounds.
The fundamental problem is that Drill discards information important to JSON.
In JSON, a field can have multiple states: not-provided, null, map (perhaps
empty), list (perhaps empty), number, string.
Drill cannot represent the following:
* Variable types
* Not-provided
* Null map
* Null array
As a result, we try to “compress” the JSON states into the smaller set of Drill
states. To have an accurate solution, we must make (at least) three changes:
* Add a null bit to map and array (Go from MapVector to NullableMapVector, etc.)
* Include the not-provided bit.
* Support variant (union) vectors.
The good news is that Drill already provides the essential pieces.
* Drill provides null flag vectors for other vectors. There is nothing (other
than work) preventing us from adding them to maps and arrays, and lists.
* While we call the nullable flag a “bit” vector, we actually use an entire
byte per record. As a result, we can simply grab one of the seven unused bits
to use as a “not-provided” bit.
* Drill provides a (partial implementation of) a variant (or “union") vector.
Building on those three components, we can achieve complete support of the JSON
standard.
> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>
> Key: DRILL-4824
> URL: https://issues.apache.org/jira/browse/DRILL-4824
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - JSON
> Affects Versions: 1.0.0
> Reporter: Roman
> Assignee: Serhii Harnyk
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
> "Field1" : {
> }
> }
> {
> "Field1" : {
> "InnerField1": {"key1":"value1"},
> "InnerField2": {"key2":"value2"}
> }
> }
> {
> "Field1" : {
> "InnerField3" : ["value3", "value4"],
> "InnerField4" : ["value5", "value6"]
> }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> | Field1 |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2"
> {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested
> structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> | Field1 |
> +--------------------------+
> |{}
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)