[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305682#comment-16305682 ]
Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:35 PM: -------------------------------------------------------------- h4. Background The above sections focus on the actual implementation of JSON in Drill. This section takes a step back to provide broader industry context. h4. The JSON Data Model [JSON is built on … universal data structures|https://www.json.org]. JSON is truly schema-free: it supports any arbitrary (DAG) data model. As shown on the [JSON site|https://www.json.org]. JSON is based on three simple concepts: objects (lists of name/value pairs), lists, and scalar values. JSON is a newer format, but the concept is classic. XML was based on a similar model (but was much more verbose and complex.) XML itself was a rehash of an even older model, the [CODASYL|https://en.wikipedia.org/wiki/CODASYL] model that described hierarchical databases in the 1960s. The power of a universal representation comes at a cost: queries are complex. CODASYL defined a model, as did XML with [XQuery|https://en.wikipedia.org/wiki/XQuery]. In browsers, the [DOM|https://en.wikipedia.org/wiki/Document_Object_Model] provides a programmatic query language for XML trees that define web pages. Each of these query systems is very complex; far too complex for everyday business use. h4. The Relational Data Model The relational model was invented by Edgar Codd as a [reaction to the complexity|http://history-computer.com/ModernComputer/Software/Codd.html] of the prior CODASYL model: bq. Don Chamberlin, … coinventor of SQL, \[said of] Codd's ideas: "...since I'd been studying CODASYL (the language used to query navigational databases), I could imagine how those queries would have been represented in CODASYL by programs that were five pages long, that would navigate through this labyrinth of pointers and stuff. Codd would sort of write them down as one-liners. ... They weren't complicated at all. I said, 'Wow.' This was kind of a conversion experience for me. I understood what the relational thing was about after that." h4. The Drill Data Model Drill is a relational engine based on the SQL language (and hence the relational data model). Drill uses a columnar representation to realize the [basic relational contracts|https://en.wikibooks.org/wiki/Relational_Database_Design/Basic_Concepts]: * Domain (set of values, often expressed as a specific data type) * Column: a (name, domain) “attribute that describe an entity in the database model.” * Row: a complete set of columns * Table: a collection of rows The key point is that relational theory is based on relations (tables) that consist of a set of rows, all of which have the same set of columns (the same schema). In fact, the restriction to a consistent schema is what allows Drill to store data as columns rather than rows. (The columnar model would make little sense for, say, an HTML document in which each element has a different set of attributes.) The relational model is decidedly *not* universal: it is a highly restricted data model chosen because of the expressive power that relational theory provides when data is restricted to a tabular presentation. The complexity of Drill's JSON implementation arises from the idea that Drill can automatically map from JSON's universal format to the relational tabular format. That is an assumption that turns out to be naive. was (Author: paul.rogers): h4. Background The above sections focus on the actual implementation of JSON in Drill. This section takes a step back to provide broader industry context. h4. The JSON Data Model [JSON is built on … universal data structures|https://www.json.org]. JSON is truly schema-free: it supports any arbitrary (DAG) data model. As shown on the [JSON site|https://www.json.org]. JSON is based on three simple concepts: objects (lists of name/value pairs), lists, and scalar values. JSON is a newer format, but the concept is classic. XML was based on a similar model (but was much more verbose and complex.) XML itself was a rehash of an even older model, the [CODASYL|https://en.wikipedia.org/wiki/CODASYL] model that described hierarchical databases in the 1960s. The power of a universal representation comes at a cost: queries are complex. CODASYL defined a model, as did XML with [XQuery|https://en.wikipedia.org/wiki/XQuery]. In browsers, the [DOM|https://en.wikipedia.org/wiki/Document_Object_Model] provides a programmatic query language for XML trees that define web pages. Each of these query systems is very complex; far too complex for everyday business use. h4. The Relational Data Model The relational model was invented by Edgar Codd as a [reaction to the complexity|http://history-computer.com/ModernComputer/Software/Codd.html] of the prior CODASYL model: bq. Don Chamberlin, … coinventor of SQL, \[said of] Codd's ideas: "...since I'd been studying CODASYL (the language used to query navigational databases), I could imagine how those queries would have been represented in CODASYL by programs that were five pages long, that would navigate through this labyrinth of pointers and stuff. Codd would sort of write them down as one-liners. ... They weren't complicated at all. I said, 'Wow.' This was kind of a conversion experience for me. I understood what the relational thing was about after that." Drill is a relational engine based on the SQL language (and hence the relational data model). Drill uses a columnar representation to realize the [basic relational contracts|https://en.wikibooks.org/wiki/Relational_Database_Design/Basic_Concepts]: * Domain (set of values, often expressed as a specific data type) * Column: a (name, domain) “attribute that describe an entity in the database model.” * Row: a complete set of columns * Table: a collection of rows The key point is that relational theory is based on relations (tables) that consist of a set of rows, all of which have the same set of columns (the same schema). In fact, the restriction to a consistent schema is what allows Drill to store data as columns rather than rows. (The columnar model would make little sense for, say, an HTML document in which each element has a different set of attributes.) The relational model is decidedly *not* universal: it is a highly restricted data model chosen because of the expressive power that relational theory provides when data is restricted to a tabular presentation. The complexity of Drill's JSON implementation arises from the idea that Drill can automatically map from JSON's universal format to the relational tabular format. That is an assumption that turns out to be naive. > Specify Drill's JSON behavior > ----------------------------- > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.13.0 > Reporter: Paul Rogers > Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)