paul-rogers opened a new pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023
 
 
   # [DRILL-7640](https://issues.apache.org/jira/browse/DRILL-7640): EVF-based 
JSON Loader
   
   ## Description
   
   Builds on the JSON structure parser and several other PRs to provide an 
enhanced, robust mechanism to read JSON data into value vectors via the EVF. 
This is not the JSON reader, rather it is the "V2" version of the JsonProcessor 
which does the actual JSON parsing/loading work.
   
   This PR is a partial do-over of an earlier PR, DRILL-6953. This PR contains 
only the lower-level JSON loader. A new PR for DRILL-6953 will follow which 
will add the JSON reader and fix any compatibility issues. Doing the PR 
separately reduces the size of each PR, making them easier to review and manage.
   
   The concept is that we already have the JSON structure parser (thanks to 
those that reviewed those PRs!). The structure parser emits *events* to 
*listeners*. This PR implements the listeners which use EVF to write values to 
value vectors.
   
   Some new features provided by the JSON loader include:
   
   * Built in conversion from (almost) any JSON type to (almost) any other 
type. If a field starts a String, and shifts to Number, Drill will continue to 
write the value as `VARCHAR`.
   * Allow runs of nulls (`null`) and/or empty arrays (`[ ]`). Defer type 
selection until the first actual value appears. (Or, force selection of 
`Nullable VARCHAR` if the batch ends without seeing any type.)
   * An array with null entries `[ null ]` forces type selection to `Nullable 
VARCHAR` since we must count the null entries.
   * Support for a provided schema. If the input is ambiguous, or inconsistent, 
the provided schema states the desired column type; the JSON loader converts 
data to that type.
   * The provided schema allows "text mode" selection per-column. The JSON 
loader supports the "all text mode" setting from before, but now allows "text 
mode" per column for only those columns which are a problem.
   * The provided schema also allows a new "JSON mode": the field, and all its 
children, are read as JSON text. That is, if the JSON is `{a: {b: ["foo", 10, 
true]}}`, and column `a` is read as JSON, then the value of `a` is `"{b: 
["foo", 10, true]}"`.
   
   We can expect a number of tweaks and adjustments to be needed in a later PR 
when we get the existing tests to pass with the new JSON format plugin. The 
goal here is to simply get the bulk of the work reviewed separately.
   
   To be clear, in this PR, nothing other than unit tests uses this new code.
   
   This PR contains a few changes which duplicate those in the PR for 
DRILL-7633. Once DRILL-7633 is merged, this PR will be rebased and the 
overlapping changes will disappear.
   
   ## Documentation
   
   Once the JSON format plugin PR is submitted, users can use the provided 
column properties mentioned above. This PR does not yet expose them, so no 
documentation is needed yet.
   
   ## Testing
   
   Added unit tests for all of the cases and features described above. Also 
tests failure cases (such as JSON with inconsistent types which the JSON loader 
cannot handle.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to