[jira] [Commented] (DRILL-4824) JSON with complex nested data produces incorrect output with missing fields

Paul Rogers (JIRA) Thu, 18 May 2017 17:48:19 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016738#comment-16016738
 ]


Paul Rogers commented on DRILL-4824:
------------------------------------

Additional thoughts as we again look at this bug.

The problem is not in the reader itself, it is in how Drill represents JSON.

To fix this, we’d have to allow multiple null states. To do that, we’d have to 
adjust how we represent nulls, which has its own set of issues. See earlier 
comments.

Today, the “isSet” (bit) vector is 0 for null, 1 for set. To allow multiple 
nulls states, we need semantics that say 0 = set, non-zero is null. Then, 0x01 
is plain old null, 0x03 could indicate null-and-unset.

Then, the reader (actually the mutator) would have to fill in the proper null 
value for missing fields. That is, when we write record 100 (say), we’d notice 
that we’ve not written a value for column x since record 95, so we’d fill in 
the “missing” values with the null-and-not-set value.

Today, we can just rely on the default value of 0 to indicate null. But, for 
variable-width columns, we do have to back-fill the offset vectors, so we could 
do the same logic for all nullable types.

Once we have the two forms of null flags, then the JSON writer can do the right 
thing. If just null, emit “foo: null”. If null-and-unset, skip emitting the 
field.

The result is that we should be able to scan, then CTAS a JSON file and get 
semantically the same output as the input (without removing null fields and 
without inserting nulls for missing fields — our only two choices today.)

The work to prevent memory fragmentation is creating a new "size-aware" mutator 
(vector writer). We can easily extend that work to handle the two null cases.

But, the big project is changing the “polarity” of null: doing so requires 
inspecting all code.

One other related improvement has to do with variable width columns. Today, we 
have an inefficiency, we need the data vector, the offset vector and the null 
(bit) vector. As the result of my changes, no vector can be larger than 16 MB, 
which means no offset can be larger than 16 MB. This is 0xFD_8000. We store 
offsets as ints, maximum value of 0xFFFF_FFFF. This means we can play a very 
simple trick: use bits 29 and 30 (or 30 and 31 if we don’t mind negatives) to 
hold the bit flags and simply omit the bit vector. That immediately saves 64K 
memory per nullable VarChar per batch.

And if we change the offset vectors, we should change the semantics from “store 
the start pointer” to “store the end pointer.” That is, instead of:

\[0, 10, 20, 30]

To store three 10-byte strings, use:

\[10, 20, 30]

So we save four bytes, no big deal, right? Actually, Boaz realized that we get 
hit by the power-of-two rounding. He has hash tables of 64K entries. Because we 
need 64K + 1 entries in the offset vector, we actually allocate 128K offsets, 
resulting in a waste of 256K to store those extra four bytes in a 64K batch.

All this points out that the JSON fix is not trivial; that’s why the original 
PR didn’t take progress. We have to fix some fundamentals first to lay the 
groundwork.

> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>
>                 Key: DRILL-4824
>                 URL: https://issues.apache.org/jira/browse/DRILL-4824
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - JSON
>    Affects Versions: 1.0.0
>            Reporter: Roman
>            Assignee: Volodymyr Vysotskyi
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
>         "Field1" : {
>         }
> }
> {
>         "Field1" : {
>                 "InnerField1": {"key1":"value1"},
>                 "InnerField2": {"key2":"value2"}
>         }
> }
> {
>         "Field1" : {
>                 "InnerField3" : ["value3", "value4"],
>                 "InnerField4" : ["value5", "value6"]
>         }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> |          Field1           |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" 
> {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested 
> structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> |         Field1           |
> +--------------------------+
> |{}                                                                     
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-4824) JSON with complex nested data produces incorrect output with missing fields

Reply via email to