[jira] [Updated] (DRILL-6062) Simplify JSON input format

Paul Rogers (JIRA) Thu, 28 Dec 2017 13:18:07 -0800

     [ 
https://issues.apache.org/jira/browse/DRILL-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-6062:
-------------------------------
    Summary: Simplify JSON input format  (was: Simplify, Document JSON input 
format)

> Simplify JSON input format
> --------------------------
>
>                 Key: DRILL-6062
>                 URL: https://issues.apache.org/jira/browse/DRILL-6062
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Paul Rogers
>
> DRILL-6035 defines the limitations with Drill's 1.12 and 1.13 JSON readers. 
> Many of these limitations are due to the difficulty of mapping arbitrary JSON 
> documents into a relational model. Drill has many ad-hoc, partial solutions, 
> but those do not provide complete, production-quality solutions.
> Solutions for full JSON schema mapping are likely beyond what Drill can (or 
> should) achieve. This ticket suggests we take a different, more realistic 
> approach and simply acknowledge that Parquet is the best format for Drill, 
> while providing minimal (but solid) JSON support.
> h4. Redefine Drill's Target Data Model
> Change the Drill web site to explain that Parquet is Drill's target data 
> model. Drill supports other formats to the degree that they mimic (a subset 
> of) Parquet.
> More specifically:
> * Drill is a relational, columnar engine.
> * Each Drill column must have a single, known data type.
> * Drill arrays cannot contain null values.
> * Drill supports maps (Parquet structs) and repeated maps
> * Drill assumes that the file schema is the same across all files in a data 
> set.
> As it turns out, this is exactly the Parquet model.
> h4. Redefine Drill's JSON Support
> Given the above, redefine the JSON that Drill support to that which follows 
> the Parquet model. Drill provides no external schema. Instead, the JSON must 
> be structured to provide a single, clear mapping from the JSON to Drill's 
> internal Parquet format, with no ambiguities:
> * Every file consists of a fixed set of objects.
> * Lists of scalars (without nulls) or objects.
> * Single, consistent type for each name/value pair.
> * No null values. (For key/value pairs, omit the pair if the value is null.)
> * No empty files.
> Of particular concern are files with high "null density": many nulls without 
> declaring a type. Drill cannot effectively support such files.
> h4. External ETL for Non-Compliant JSON
> Rather than either a) invest in JSON mapping, or b) allow queries to fail, 
> Drill should encourage the use of external ETL tools to convert non-compliant 
> JSON into Parquet files. Since most JSON is ad-hoc, created by and for 
> specific applications, this means most JSON should pass through an ETL layer 
> into Parquet before being used with Drill.
> h4. Simplify the JSON Reader
> The JSON reader today attempts to use many partial, ad-hoc fixes to work 
> around some JSON ambiguity. These hacks are hard to test and maintain, 
> requiring effort that would be better invested elsewhere. Once we adopt 
> Parquet as the reference format, and define the small, simpler form of JSON, 
> we can remove the hacks:
> * Drop support for unions. (Unions are poorly supported and very complex.)
> * Drop support for the {{ListVector}} (which is, essentially, a list of 
> unions and does not even work.)
> * Drop support for multi-dimensional lists. (These do not have any 
> well-defined mapping to relational tables.)
> * Drop support for leading nulls that span batches. (That is, the type of 
> every value must be revealed within the first batch.)
> * Drop support for empty files. (Drill needs a schema internally. Drill 
> invents a fake schema today, but that just causes a schema change later. If 
> desired, simply ignore such files rather than failing the query.)
> h4. Implications for Drill 1.13 "Result Set Loader"
> Much work was done to try to extend the result set loader to handle JSON 
> ambiguities. The List Vector, Repeated List Vector and Union Vectors were all 
> implemented, leading to a vast increase in complexity. If we adopt the above, 
> this work can be backed out, resulting in a smaller, more efficient, 
> streamlined core. In short, remove the poorly-supported components used only 
> by JSON, keeping the types and mechanisms needed for Parquet (and Drill's 
> internal operators.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (DRILL-6062) Simplify JSON input format

Reply via email to