[
https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016738#comment-16016738
]
Paul Rogers commented on DRILL-4824:
------------------------------------
Additional thoughts as we again look at this bug.
The problem is not in the reader itself, it is in how Drill represents JSON.
To fix this, we’d have to allow multiple null states. To do that, we’d have to
adjust how we represent nulls, which has its own set of issues. See earlier
comments.
Today, the “isSet” (bit) vector is 0 for null, 1 for set. To allow multiple
nulls states, we need semantics that say 0 = set, non-zero is null. Then, 0x01
is plain old null, 0x03 could indicate null-and-unset.
Then, the reader (actually the mutator) would have to fill in the proper null
value for missing fields. That is, when we write record 100 (say), we’d notice
that we’ve not written a value for column x since record 95, so we’d fill in
the “missing” values with the null-and-not-set value.
Today, we can just rely on the default value of 0 to indicate null. But, for
variable-width columns, we do have to back-fill the offset vectors, so we could
do the same logic for all nullable types.
Once we have the two forms of null flags, then the JSON writer can do the right
thing. If just null, emit “foo: null”. If null-and-unset, skip emitting the
field.
The result is that we should be able to scan, then CTAS a JSON file and get
semantically the same output as the input (without removing null fields and
without inserting nulls for missing fields — our only two choices today.)
The work to prevent memory fragmentation is creating a new "size-aware" mutator
(vector writer). We can easily extend that work to handle the two null cases.
But, the big project is changing the “polarity” of null: doing so requires
inspecting all code.
One other related improvement has to do with variable width columns. Today, we
have an inefficiency, we need the data vector, the offset vector and the null
(bit) vector. As the result of my changes, no vector can be larger than 16 MB,
which means no offset can be larger than 16 MB. This is 0xFD_8000. We store
offsets as ints, maximum value of 0xFFFF_FFFF. This means we can play a very
simple trick: use bits 29 and 30 (or 30 and 31 if we don’t mind negatives) to
hold the bit flags and simply omit the bit vector. That immediately saves 64K
memory per nullable VarChar per batch.
And if we change the offset vectors, we should change the semantics from “store
the start pointer” to “store the end pointer.” That is, instead of:
\[0, 10, 20, 30]
To store three 10-byte strings, use:
\[10, 20, 30]
So we save four bytes, no big deal, right? Actually, Boaz realized that we get
hit by the power-of-two rounding. He has hash tables of 64K entries. Because we
need 64K + 1 entries in the offset vector, we actually allocate 128K offsets,
resulting in a waste of 256K to store those extra four bytes in a 64K batch.
All this points out that the JSON fix is not trivial; that’s why the original
PR didn’t take progress. We have to fix some fundamentals first to lay the
groundwork.
> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>
> Key: DRILL-4824
> URL: https://issues.apache.org/jira/browse/DRILL-4824
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - JSON
> Affects Versions: 1.0.0
> Reporter: Roman
> Assignee: Volodymyr Vysotskyi
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
> "Field1" : {
> }
> }
> {
> "Field1" : {
> "InnerField1": {"key1":"value1"},
> "InnerField2": {"key2":"value2"}
> }
> }
> {
> "Field1" : {
> "InnerField3" : ["value3", "value4"],
> "InnerField4" : ["value5", "value6"]
> }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> | Field1 |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2"
> {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested
> structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> | Field1 |
> +--------------------------+
> |{}
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)