[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305689#comment-16305689
 ] 

Paul Rogers commented on DRILL-6035:
------------------------------------

h4. JSON Translation in Drill

The goal of the JSON reader is to load JSON data into Drill vectors. The 
challenge, as shown above, is that the two data models are different. The 
problems are two-fold:

* There exist many perfectly valid ways to translate relational data into JSON.
* There exist no universal ways to translate arbitrary JSON into a relational 
model.

That is, JSON can represent any relational table, and do so in a variety of 
ways. There is a one (table) to many (JSON formats) relationship. Further, 
there are many JSON structures that do not correspond to tables.

h4. Lack of JSON Support in JDBC and ODBC

The issue is further complicated by the fact that ODBC (Tableau) and JDBC are 
Drill’s primary client interfaces. These formats do not readily support 
non-tabular data. Thus, not only just Drill successfully consume JSON DAGs, 
Drill must then convert these structures into simple tables to be consumed by 
BI tools. (That is, Drill is not a document-oriented query engine, it is a 
classic tabular query engine.)

h4. Many-to-Many Mappings from JSON to Tables

The challenge of the JSON reader is to convert arbitrary JSON into a relational 
model, and to do so with no (or very little) information beyond a list of 
projected columns and the JSON file itself.

As we will see, the problem is fundamentally not solvable: Drill has too little 
information to correctly map from the many possible (and often conflicting) 
JSON formats into the proper relational format.

Consider a trivial example.  The following is a perfectly legal representation 
of a Customer in JSON:

{code}
[101, “Fred”, “Bedrock”, 123.45, “10-11-12”]
{code}

Applications sometimes use the above format to conserve space. It works because 
the writer and reader agree on the meaning of each array entry.

How is Drill to interpret the above. More to the point, how is Drill to 
interpret the above *without a schema*? Said another way, the above format 
works because the writer and reader agree on a format, but Drill is designed to 
work without that information. Clearly, without a schema, it is impossible to 
understand that the above is a terse representation of a row.

Without a schema, all Drill knows is that the above is a heterogeneous array. 
How is Drill to know that this array has a one-to-one correspondence to columns 
in a customer record vs. say, an arbitrary array? It can’t. All Drill can do is 
ask the user to enable “all-text mode”, read the array as text, then use SQL to 
project the array entries correctly:

{code}
SELECT CAST(cust[0] AS INTEGER) AS cust_id,
       cust[1] AS cust_name, cust[2] AS city,
       CAST(cust[3] AS FLOAT8) AS balance,
       TO_DATE(cust[4], ‘yy-mm-dd’) as start_date FROM …
{code}

Although this example is contrived, “JSON in the wild” has a wide variety of 
formats since JSON is a universal format and places no constraints on an 
application’s data model.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to