[
https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017792#comment-16017792
]
Paul Rogers commented on DRILL-4824:
------------------------------------
Thanks for the explanation! Let’s take a step back and extract the
requirements/goals from the implementation outline:
* Allow maps to be nullable.
* Allow evolving the type of a column based on data observed.
Let’s talk a bit more about each one. For the map vector, I agree that we’d
need to add a “bit vector” to track the nullability of the entire map. This
will be tricky as it must be coordinated with each entry in the map: if the map
is null, then every vector in the map must also be null (so that we maintain
proper row indexing and keep the offset vectors up-to-date.) So, from the
perspective of existing code a null map, and a map of nulls, is equivalent. For
output, however, a null map would be different, at least for JSON.
We’d have to do the same for arrays. A field foo might be an array in JSON, or
null. So, we’d either need an “isNull” vector for repeated type members, or do
the bitmap trick to add this info to the array offset vector.
Let's think about changing the data type. We can only play the “revise the type
based on new info” game on the first batch. Once the JSON reader sends a batch
downstream, changing the type will be a schema change, which would be fine
except that many Drill operators, and no JDBC/ODBC clients, handle schema
changes.
Still, the idea is good for data that has frequent variation within the first
batch (first 10K-60K records.) For example, seeing “10, 20, 30, 40.5” would
mean that the value could start as integer, then evolve to double.
Perhaps we can do this in the new mutator created for DRILL-5211. Just start
writing data as one type and silently replace the original vector with a new on
one of the new type. We’d define a “promotion” matrix: vector x can be promoted
to y.” (Int to long to double to decimal, say.)
If we do this in the mutator, then every reader will have the ability to do the
same trick. That is, perhaps only JSON needs the set/not-set flag and nullable
arrays and maps. But, all readers can benefit from the ability to evolve type
selection based on observed data.
Changing the data type may require copying: copying the first 5K ints, say,
when we discover that the type is really double. I'd suggest that the cost of
copying is acceptable. We copy the data anyway as we grow vectors. In general,
a 16 MB vector (the new max size) will get that way by doubling from, say,
256K: 256K, 512K, 1M, 2M, 4M, 8M, 16M. (This is something I hope to improve,
but that is another topic.)
The new mutator (vector writers) work by having a single column writer type
with methods like setInt, setLong, setDouble, etc. Internally each “column
writer” turns around and calls a generated, type-specific writer. So, setInt()
calls setInt() on the version generated for IntVector. For an int, all other
methods (setLong, setDouble) throw an exception.
To allow type promotion, we’d create a second implementation that would:
* Promote the vector and change writers as needed (calling setDouble on an int
vector, say).
* Convert compatible types. (Calling setInt on a double vector, say.)
The result is that the work is completely transparent to the record reader. The
record reader just calls setFoo() for some type Foo, and the mutator does the
rest.
If we go that route, we can divide up the work into a number of JIRAs and work
out who does which parts.
> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>
> Key: DRILL-4824
> URL: https://issues.apache.org/jira/browse/DRILL-4824
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - JSON
> Affects Versions: 1.0.0
> Reporter: Roman
> Assignee: Volodymyr Vysotskyi
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
> "Field1" : {
> }
> }
> {
> "Field1" : {
> "InnerField1": {"key1":"value1"},
> "InnerField2": {"key2":"value2"}
> }
> }
> {
> "Field1" : {
> "InnerField3" : ["value3", "value4"],
> "InnerField4" : ["value5", "value6"]
> }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> | Field1 |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2"
> {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested
> structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> | Field1 |
> +--------------------------+
> |{}
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)