Paul Rogers created DRILL-6953:
----------------------------------

             Summary: Merge row set-based JSON reader
                 Key: DRILL-6953
                 URL: https://issues.apache.org/jira/browse/DRILL-6953
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.15.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers
             Fix For: 1.16.0


The final step in the ongoing "result set loader" saga is to merge the revised 
JSON reader into master. This reader does two key things:

* Demonstrates the prototypical "late schema" style of data reading (discover 
schema while reading).
* Implements many tricks and hacks to handle schema changes while loading.
* Shows that, even with all these tricks, the only true solution is to actually 
have a schema.

The new JSON reader:

* Uses an expanded state machine when parsing rather than the complex set of 
if-statements in the current version.
* Handles reading a run of nulls before seeing the first data value (as long as 
the data value shows up in the first record batch).
* Uses the result-set loader to generate fixed-size batches regardless of the 
complexity, depth of structure, or width of variable-length fields.

While the JSON reader itself is helpful, the key contribution is that it shows 
how to use the entire kit of parts: result set loader, projection framework, 
and so on. Since the projection framework can handle an external schema, it is 
also a handy foundation for the ongoing schema project.

Key work to complete after this merger will be to reconcile actual data with 
the external schema. For example, if we know a column is supposed to be a 
VarChar, then read the column as a VarChar regardless of the type JSON itself 
picks. Or, if a column is supposed to be a Double, then convert Int and String 
JSON values into Doubles.

The Row Set framework was designed to allow inserting custom column writers. 
This would be a great opportunity to do the work needed to create them. Then, 
use the new JSON framework to allow parsing a JSON field as a specified Drill 
type.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to