[ 
https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017792#comment-16017792
 ] 

Paul Rogers commented on DRILL-4824:
------------------------------------

Thanks for the explanation! Let’s take a step back and extract the 
requirements/goals from the implementation outline:

* Allow maps to be nullable.
* Allow evolving the type of a column based on data observed.

Let’s talk a bit more about each one. For the map vector, I agree that we’d 
need to add a “bit vector” to track the nullability of the entire map. This 
will be tricky as it must be coordinated with each entry in the map: if the map 
is null, then every vector in the map must also be null (so that we maintain 
proper row indexing and keep the offset vectors up-to-date.) So, from the 
perspective of existing code a null map, and a map of nulls, is equivalent. For 
output, however, a null map would be different, at least for JSON.

We’d have to do the same for arrays. A field foo might be an array in JSON, or 
null. So, we’d either need an “isNull” vector for repeated type members, or do 
the bitmap trick to add this info to the array offset vector.

Let's think about changing the data type. We can only play the “revise the type 
based on new info” game on the first batch. Once the JSON reader sends a batch 
downstream, changing the type will be a schema change, which would be fine 
except that many Drill operators, and no JDBC/ODBC clients, handle schema 
changes.

Still, the idea is good for data that has frequent variation within the first 
batch (first 10K-60K records.) For example, seeing “10, 20, 30, 40.5” would 
mean that the value could start as integer, then evolve to double.

Perhaps we can do this in the new mutator created for DRILL-5211. Just start 
writing data as one type and silently replace the original vector with a new on 
one of the new type. We’d define a “promotion” matrix: vector x can be promoted 
to y.” (Int to long to double to decimal, say.)

If we do this in the mutator, then every reader will have the ability to do the 
same trick. That is, perhaps only JSON needs the set/not-set flag and nullable 
arrays and maps. But, all readers can benefit from the ability to evolve type 
selection based on observed data.

Changing the data type may require copying: copying the first 5K ints, say, 
when we discover that the type is really double. I'd suggest that the cost of 
copying is acceptable. We copy the data anyway as we grow vectors. In general, 
a 16 MB vector (the new max size) will get that way by doubling from, say, 
256K: 256K, 512K, 1M, 2M, 4M, 8M, 16M. (This is something I hope to improve, 
but that is another topic.)

The new mutator (vector writers) work by having a single column writer type 
with methods like setInt, setLong, setDouble, etc. Internally each “column 
writer” turns around and calls a generated, type-specific writer. So, setInt() 
calls setInt() on the version generated for IntVector. For an int, all other 
methods (setLong, setDouble) throw an exception.

To allow type promotion, we’d create a second implementation that would:

* Promote the vector and change writers as needed (calling setDouble on an int 
vector, say).
* Convert compatible types. (Calling setInt on a double vector, say.)

The result is that the work is completely transparent to the record reader. The 
record reader just calls setFoo() for some type Foo, and the mutator does the 
rest.

If we go that route, we can divide up the work into a number of JIRAs and work 
out who does which parts.

> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>
>                 Key: DRILL-4824
>                 URL: https://issues.apache.org/jira/browse/DRILL-4824
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - JSON
>    Affects Versions: 1.0.0
>            Reporter: Roman
>            Assignee: Volodymyr Vysotskyi
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
>         "Field1" : {
>         }
> }
> {
>         "Field1" : {
>                 "InnerField1": {"key1":"value1"},
>                 "InnerField2": {"key2":"value2"}
>         }
> }
> {
>         "Field1" : {
>                 "InnerField3" : ["value3", "value4"],
>                 "InnerField4" : ["value5", "value6"]
>         }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> |          Field1           |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" 
> {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested 
> structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> |         Field1           |
> +--------------------------+
> |{}                                                                     
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to