[ 
https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857581#comment-15857581
 ] 

Paul Rogers commented on DRILL-4824:
------------------------------------

Yet another issue. Suppose we have a file that has 65K entries of the following 
record:

{code}
{ "field1": { } }
{code}

Then, we have 10 records of the following:

{code}
{ "field1" : { "fooled": "you" } }
{code}

When we read the input file, Drill realizes that "field1" is a map. When we get 
to the last 10 lines, there is no schema change, we continue to return 
(non-empty) map values. All good.

Now, do a CTAS using the proposed fix. Now we have 65K entries as follows:

{code}
{ }
{code}

Followed by the last 10 original lines.

When we read the CTAS file, we do not get the same results as from the original 
file. We get one batch with no fields. The second batch incurs a schema change 
as we add "field1" as a map. (I chose the 65K number for the first block of 
fields so that there are more records than fit in one record batch.)

Since ODBC and JDBC clients can't handle schema changes, we now have 
inconsistent behavior. The original file works, has one map field and no schema 
changes. The same data, written out using CTAS, results in an output with no 
fields, then fails with a schema change.

One can create other such scenarios. The original design *did not* suffer from 
this issue because we preserved the map/array information by emitting an empty 
map or list. The new design looses this information, resulting in changed 
behavior.

Further, Drill can't handle a record batch with an empty schema. Many operators 
fail in this case. So, the CTAS file will fail due to the pre-existing Drill 
problems.

For this reason, we probably do need to provide an option to re-enable the old 
output if customers want to re-read the CTAS table.

> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>
>                 Key: DRILL-4824
>                 URL: https://issues.apache.org/jira/browse/DRILL-4824
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - JSON
>    Affects Versions: 1.0.0
>            Reporter: Roman
>            Assignee: Serhii Harnyk
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
>         "Field1" : {
>         }
> }
> {
>         "Field1" : {
>                 "InnerField1": {"key1":"value1"},
>                 "InnerField2": {"key2":"value2"}
>         }
> }
> {
>         "Field1" : {
>                 "InnerField3" : ["value3", "value4"],
>                 "InnerField4" : ["value5", "value6"]
>         }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> |          Field1           |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" 
> {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested 
> structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> |         Field1           |
> +--------------------------+
> |{}                                                                     
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to