[
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305685#comment-16305685
]
Paul Rogers commented on DRILL-6035:
------------------------------------
h4. JSON as Drill’s Reference Data Model
As [described on the video on the Apace Drill home
page|http://drill.apache.org], Drill takes JSON as its primary data model since
it is a superset of the relational model, Parquet, AVRO and other input formats.
Drill is a “schema-free” query engine because it is based on the schema-free
data model. Yet, Drill is based on the relational model which very much
requires a schema.
The challenge, then, is how Drill represents a universal, non-relational data
model within a relational implementation. This is not a trivial question. In
fact, there is no good answer. (Many projects faced the same issue with XML;
few invented good solutions.)
At present, the concept of using JSON as the reference data model for a
relational engine is more of an aspiration than a working reality. Drill has no
specification for the theory (or rules or implementation) for how Drill maps
from JSON to relations (that is, to value vectors.) Instead, each data source
works out an implementation as best it can. This leaves the holes that we
explore here.
The fundamental problem is that JSON is universal: all structures are legal.
Relational theory is based on tables (or, with extensions, to a set of nested
tables.) [SQL++|https://arxiv.org/abs/1405.3631] is one attempt to extend SQL
to “semi-structured” data:
bq. The SQL++ semi-structured data model is a superset of both JSON and the SQL
data model. SQL++ of- fers powerful computational capabilities for processing
semi- structured data akin to prior non-relational query languages, notably OQL
and XQuery.
Our goal here is not to debate the merits of one system vs. another. Rather, we
simply wish to note that standard JSON is a superset of standard SQL and that
importing JSON into Drill is therefore not a trivial exercise.
> Specify Drill's JSON behavior
> -----------------------------
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.13.0
> Reporter: Paul Rogers
> Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests
> that Drill may have limitations in the JSON that Drill supports. This ticket
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed
> specifications that clarifies what Drill does and does not support (or what
> is should and should not support.)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)