[ 
https://issues.apache.org/jira/browse/DRILL-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985453#comment-16985453
 ] 

ASF GitHub Bot commented on DRILL-6953:
---------------------------------------

paul-rogers commented on issue #1913: DRILL-6953: EVF-based version of the JSON 
reader
URL: https://github.com/apache/drill/pull/1913#issuecomment-560030850
 
 
   Some background: this PR includes work completed about two years ago as part 
of the "row set" (EVF) project. We had to first get the EVF itself reviewed and 
merged, then we added provided schema support. The first attempt to merge the 
JSON reader uncovered many issues with batch, record and vector counts. Those 
have been fixed over the last couple of months. This time, the unit tests pass 
with the new JSON reader.
   
   This PR leaves the old "V1" reader enabled by default. More testing is 
required before we enable the "V2" reader by default.
   
   Because this work pre-dated the "provided schema" work, it does not yet 
support the provided schema. Let's get this version merged, then we can add the 
additional work needed to support a provided schema.
   
   Also, any work done in the "V1" JSON reader in the last two years is not yet 
reflected in the "V2" version. We make any such changes after this PR.
   
   JSON is a surprisingly complex and tricky format. Suggestions for further 
tests or improvements are welcome.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Merge row set-based JSON reader
> -------------------------------
>
>                 Key: DRILL-6953
>                 URL: https://issues.apache.org/jira/browse/DRILL-6953
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.15.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: Future
>
>
> The final step in the ongoing "result set loader" saga is to merge the 
> revised JSON reader into master. This reader does two key things:
> * Demonstrates the prototypical "late schema" style of data reading (discover 
> schema while reading).
> * Implements many tricks and hacks to handle schema changes while loading.
> * Shows that, even with all these tricks, the only true solution is to 
> actually have a schema.
> The new JSON reader:
> * Uses an expanded state machine when parsing rather than the complex set of 
> if-statements in the current version.
> * Handles reading a run of nulls before seeing the first data value (as long 
> as the data value shows up in the first record batch).
> * Uses the result-set loader to generate fixed-size batches regardless of the 
> complexity, depth of structure, or width of variable-length fields.
> While the JSON reader itself is helpful, the key contribution is that it 
> shows how to use the entire kit of parts: result set loader, projection 
> framework, and so on. Since the projection framework can handle an external 
> schema, it is also a handy foundation for the ongoing schema project.
> Key work to complete after this merger will be to reconcile actual data with 
> the external schema. For example, if we know a column is supposed to be a 
> VarChar, then read the column as a VarChar regardless of the type JSON itself 
> picks. Or, if a column is supposed to be a Double, then convert Int and 
> String JSON values into Doubles.
> The Row Set framework was designed to allow inserting custom column writers. 
> This would be a great opportunity to do the work needed to create them. Then, 
> use the new JSON framework to allow parsing a JSON field as a specified Drill 
> type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to