[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

Paul Rogers (JIRA) Thu, 28 Dec 2017 11:36:50 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305682#comment-16305682
 ]


Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:35 PM:
--------------------------------------------------------------

h4. Background

The above sections focus on the actual implementation of JSON in Drill. This 
section takes a step back to provide broader industry context.

h4. The JSON Data Model

[JSON is built on … universal data structures|https://www.json.org]. JSON is 
truly schema-free: it supports any arbitrary (DAG) data model. As shown on the 
[JSON site|https://www.json.org]. JSON is based on three simple concepts: 
objects (lists of name/value pairs), lists, and scalar values.

JSON is a newer format, but the concept is classic. XML was based on a similar 
model (but was much more verbose and complex.) XML itself was a rehash of an 
even older model, the [CODASYL|https://en.wikipedia.org/wiki/CODASYL] model 
that described hierarchical databases in the 1960s.

The power of a universal representation comes at a cost: queries are complex. 
CODASYL defined a model, as did XML with 
[XQuery|https://en.wikipedia.org/wiki/XQuery]. In browsers, the 
[DOM|https://en.wikipedia.org/wiki/Document_Object_Model] provides a 
programmatic query language for XML trees that define web pages. Each of these 
query systems is very complex; far too complex for everyday business use.

h4. The Relational Data Model

The relational model was invented by Edgar Codd as a [reaction to the 
complexity|http://history-computer.com/ModernComputer/Software/Codd.html] of 
the prior CODASYL model:

bq. Don Chamberlin, … coinventor of SQL, \[said of] Codd's ideas: "...since I'd 
been studying CODASYL (the language used to query navigational databases), I 
could imagine how those queries would have been represented in CODASYL by 
programs that were five pages long, that would navigate through this labyrinth 
of pointers and stuff. Codd would sort of write them down as one-liners. ... 
They weren't complicated at all. I said, 'Wow.' This was kind of a conversion 
experience for me. I understood what the relational thing was about after that."

h4. The Drill Data Model

Drill is a relational engine based on the SQL language (and hence the 
relational data model). Drill uses a columnar representation to realize the 
[basic relational 
contracts|https://en.wikibooks.org/wiki/Relational_Database_Design/Basic_Concepts]:

* Domain (set of values, often expressed as a specific data type)
* Column: a (name, domain) “attribute that describe an entity in the database 
model.”
* Row: a complete set of columns
* Table: a collection of rows

The key point is that relational theory is based on relations (tables) that 
consist of a set of rows, all of which have the same set of columns (the same 
schema). In fact, the restriction to a consistent schema is what allows Drill 
to store data as columns rather than rows. (The columnar model would make 
little sense for, say, an HTML document in which each element has a different 
set of attributes.)

The relational model is decidedly *not* universal: it is a highly restricted 
data model chosen because of the expressive power that relational theory 
provides when data is restricted to a tabular presentation.

The complexity of Drill's JSON implementation arises from the idea that Drill 
can automatically map from JSON's universal format to the relational tabular 
format. That is an assumption that turns out to be naive.


was (Author: paul.rogers):
h4. Background

The above sections focus on the actual implementation of JSON in Drill. This 
section takes a step back to provide broader industry context.

h4. The JSON Data Model

[JSON is built on … universal data structures|https://www.json.org]. JSON is 
truly schema-free: it supports any arbitrary (DAG) data model. As shown on the 
[JSON site|https://www.json.org]. JSON is based on three simple concepts: 
objects (lists of name/value pairs), lists, and scalar values.

JSON is a newer format, but the concept is classic. XML was based on a similar 
model (but was much more verbose and complex.) XML itself was a rehash of an 
even older model, the [CODASYL|https://en.wikipedia.org/wiki/CODASYL] model 
that described hierarchical databases in the 1960s.

The power of a universal representation comes at a cost: queries are complex. 
CODASYL defined a model, as did XML with 
[XQuery|https://en.wikipedia.org/wiki/XQuery]. In browsers, the 
[DOM|https://en.wikipedia.org/wiki/Document_Object_Model] provides a 
programmatic query language for XML trees that define web pages. Each of these 
query systems is very complex; far too complex for everyday business use.

h4. The Relational Data Model

The relational model was invented by Edgar Codd as a [reaction to the 
complexity|http://history-computer.com/ModernComputer/Software/Codd.html] of 
the prior CODASYL model:

bq. Don Chamberlin, … coinventor of SQL, \[said of] Codd's ideas: "...since I'd 
been studying CODASYL (the language used to query navigational databases), I 
could imagine how those queries would have been represented in CODASYL by 
programs that were five pages long, that would navigate through this labyrinth 
of pointers and stuff. Codd would sort of write them down as one-liners. ... 
They weren't complicated at all. I said, 'Wow.' This was kind of a conversion 
experience for me. I understood what the relational thing was about after that."

Drill is a relational engine based on the SQL language (and hence the 
relational data model). Drill uses a columnar representation to realize the 
[basic relational 
contracts|https://en.wikibooks.org/wiki/Relational_Database_Design/Basic_Concepts]:

* Domain (set of values, often expressed as a specific data type)
* Column: a (name, domain) “attribute that describe an entity in the database 
model.”
* Row: a complete set of columns
* Table: a collection of rows

The key point is that relational theory is based on relations (tables) that 
consist of a set of rows, all of which have the same set of columns (the same 
schema). In fact, the restriction to a consistent schema is what allows Drill 
to store data as columns rather than rows. (The columnar model would make 
little sense for, say, an HTML document in which each element has a different 
set of attributes.)

The relational model is decidedly *not* universal: it is a highly restricted 
data model chosen because of the expressive power that relational theory 
provides when data is restricted to a tabular presentation.

The complexity of Drill's JSON implementation arises from the idea that Drill 
can automatically map from JSON's universal format to the relational tabular 
format. That is an assumption that turns out to be naive.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

Reply via email to