[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

Paul Rogers (JIRA) Thu, 28 Dec 2017 11:58:18 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305699#comment-16305699
 ]


Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:57 PM:
--------------------------------------------------------------

h4. Solutions

Drill's JSON support is mature; it is used successfully by a wide range of 
users. Presumably, the issues discussed above are only theoretical; practical 
users do not encounter them. (If they did, we'd see many more open bugs in this 
area.)

Further, though JSON started as Drill's reference data source, real-world 
experience has shown that Parquet has assumed that role; JSON is not used only 
lightly. Hence, there may little interest or value in addressing the above 
issues.

Still, it is worth presenting some options if it ever becomes necessary to 
improve Drill's JSON support.

h4. Futures: Stats

Drill can address its inability to predict the future by converting the problem 
into one of learning from the past. Specifically, Drill can correctly pick the 
type of each column if Drill were to scan the entire collection of JSON files 
before executing the first query, and recording the inferred type of each 
column.

Drill already has (a prototype of) a statistics (stats) collection mechanism. 
Drill could extend this mechanism to scan the full set of JSON files. For each 
column, implement inference rules such as:

* Record the data types seen for each column. If a column is mostly nulls (or 
missing), even a single occurrence, even within terabytes of files, would be 
enough to tell Drill the type.
* When a column has a conflicting scalar type (both string and number, say), 
record the “most permissive” type. That is, do an automatic per-column “text 
mode” for scalar conflicts. (Or, for numbers, simply pick FLOAT8 if even a 
single value has a decimal point.)
* For numbers, use the value of the largest number to infer type. (INT, BITINT, 
FLOAT8 or DECIMAL).
* If a column has both primitive and structured values (both a string and map, 
say), then mark the column as needed an “enhanced” text mode: read the complex 
type as a JSON string.

The above simply extends existing schema inference beyond the first row of each 
file to the entire collection of files. Clearly there is a cost (the need to 
scan the entire set of JSON files, and keep the stats up to date via repeated 
scans.)

The solution is not perfect, but may reduce problems by some percentage, say 
80-90%. This solution has the advantage of living up to Drill’s core promise of 
being “schema-free”: the user does not provide a schema, though Drill infers a 
schema from actual data.

h4. Futures: Schema Metadata

The stats solution is still not complete: it cannot address the issue of 
intent: what does the JSON actually represent? It can only deal with the entire 
body of JSON at the syntax level.

The next level of solution allows the user to specify their intent. That is, to 
specify rules for mapping their application-specific JSON encoding into the 
relational model used by Drill.

Many of the difficulties in JSON come from ambiguities that limit Drill's 
ability to map JSON into a fixed set of tabular columns. One straightforward 
solution is to supply a schema (which may include mapping rules) that resolve 
the ambiguities for each application.

Advantages of a defined schema:

* Code generation vs. interpretation — performance
* Avoid ambiguities — fewer failures

The metadata might include simple information such as:

* The declared type of each column. (Drill uses that type even if values are 
null or missing, eliminating the need to guess.)
* Columns to be ignored in a `SELECT *` query. (Such as internal metadata 
columns.)

The metadata layer would be much more powerful if it also included mapping 
rules for cases such as:

* Identify that an array is represents a tuple, and provide column names and 
types for each array element.
* Identify that a map structure is used to encode the name and type of a value, 
along with rules to map that value to a Drill column.

An example of the first case above:

{noformat}
Metadata: Array as a row: [cust_id INT, name: VARCHAR….]
Data: [101, “fred”, …]
{noformat}

Example of the second case:

{noformat}
Mapping rule: for key {“$type:” type, “$value”: value},
Create a column named <key> of type <type> with value <value>.
Data: cust_id {“$type”: “INT”, value: 101}
{noformat}

The need for the mapping (and the required set of features) will be made 
clearer by gathering many samples of JSON “in the wild” to learn how 
application developers actually encode their data into JSON.

Possible sources of a schema:

* Hive
* Schema file stored with the data
* Metadata system as in Impala
* Custom mapping code that is a "data format" plugin on top of the JSON format 
plugin

Drill uses views for this role today. But, views introduce a chicken-and-egg 
problem and cannot be the (whole) solution for JSON. Views are a layer within 
Drill: views assume that the data has already been translated into Drill 
vectors. Thus, views cannot be the solution for specifying that translation can 
occur. (This is a subtle point. Drill claims that views are a data translation 
solution, but clearly they cannot be the complete solution for JSON.)

h4. Alternative: ETL to Parquet

Cleary, there are far more JSON formats in the wild than Drill can support 
well, even with its many tricks and work-arounds. It may be that the best 
solution for production users is simply to convert JSON data into something 
that Drill can understand by performing a conversion of each application’s 
unique JSON format into a common standard via an ETL transform.

Since the data is being transformed, this is also an opportunity to gain 
performance by storing the converted data in a more performant format such as 
Parquet.

Although this choice conflicts with the message that Drill is schema-free and 
uses JSON as its native format, the approach does have the advantage of 
ensuring that actual queries work reliably in practice. Also, the ETL solution 
is consistent with the idea that Parquet is Drill's preferred format. It also 
acknowledges that may fine ETL tools exist (Spark, Python, commercial 
solutions) and that Drill need not expand into the ETL space.



was (Author: paul.rogers):
h4. Solutions

Drill's JSON support is mature; it is used successfully by a wide range of 
users. Presumably, the issues discussed above are only theoretical; practical 
users do not encounter them. (If they did, we'd see many more open bugs in this 
area.)

Further, though JSON started as Drill's reference data source, real-world 
experience has shown that Parquet has assumed that role; JSON is not used only 
lightly. Hence, there may little interest or value in addressing the above 
issues.

Still, it is worth presenting some options if it ever becomes necessary to 
improve Drill's JSON support.

h4. Futures: Stats

Drill can address its inability to predict the future by converting the problem 
into one of learning from the past. Specifically, Drill can correctly pick the 
type of each column if Drill were to scan the entire collection of JSON files 
before executing the first query, and recording the inferred type of each 
column.

Drill already has (a prototype of) a statistics (stats) collection mechanism. 
Drill could extend this mechanism to scan the full set of JSON files. For each 
column, implement inference rules such as:

* Record the data types seen for each column. If a column is mostly nulls (or 
missing), even a single occurrence, even within terabytes of files, would be 
enough to tell Drill the type.
* When a column has a conflicting scalar type (both string and number, say), 
record the “most permissive” type. That is, do an automatic per-column “text 
mode” for scalar conflicts. (Or, for numbers, simply pick FLOAT8 if even a 
single value has a decimal point.)
* For numbers, use the value of the largest number to infer type. (INT, BITINT, 
FLOAT8 or DECIMAL).
* If a column has both primitive and structured values (both a string and map, 
say), then mark the column as needed an “enhanced” text mode: read the complex 
type as a JSON string.

The above simply extends existing schema inference beyond the first row of each 
file to the entire collection of files. Clearly there is a cost (the need to 
scan the entire set of JSON files, and keep the stats up to date via repeated 
scans.)

The solution is not perfect, but may reduce problems by some percentage, say 
80-90%. This solution has the advantage of living up to Drill’s core promise of 
being “schema-free”: the user does not provide a schema, though Drill infers a 
schema from actual data.

h4. Futures: Schema Metadata

The stats solution is still not complete: it cannot address the issue of 
intent: what does the JSON actually represent? It can only deal with the entire 
body of JSON at the syntax level.

The next level of solution allows the user to specify their intent. That is, to 
specify rules for mapping their application-specific JSON encoding into the 
relational model used by Drill.

Many of the difficulties in JSON come from ambiguities that limit Drill's 
ability to map JSON into a fixed set of tabular columns. One straightforward 
solution is to supply a schema (which may include mapping rules) that resolve 
the ambiguities for each application.

Advantages of a defined schema:

* Code generation vs. interpretation — performance
* Avoid ambiguities — fewer failures

The metadata might include simple information such as:

* The declared type of each column. (Drill uses that type even if values are 
null or missing, eliminating the need to guess.)
* Columns to be ignored in a `SELECT *` query. (Such as internal metadata 
columns.)

The metadata layer would be much more powerful if it also included mapping 
rules for cases such as:

* Identify that an array is represents a tuple, and provide column names and 
types for each array element.
* Identify that a map structure is used to encode the name and type of a value, 
along with rules to map that value to a Drill column.

An example of the first case above:

{noformat}
Metadata: Array as a row: [cust_id INT, name: VARCHAR….]
Data: [101, “fred”, …]
{noformat}

Example of the second case:

{noformat}
Mapping rule: for key {“$type:” type, “$value”: value},
Create a column named <key> of type <type> with value <value>.
Data: cust_id {“$type”: “INT”, value: 101}
{noformat}

The need for the mapping (and the required set of features) will be made 
clearer by gathering many samples of JSON “in the wild” to learn how 
application developers actually encode their data into JSON.

Possible sources of a schema:

* Hive
* Schema file stored with the data
* Metadata system as in Impala
* Custom mapping code that is a "data format" plugin on top of the JSON format 
plugin

Drill uses views for this role today. But, views introduce a chicken-and-egg 
problem and cannot be the (whole) solution for JSON. Views are a layer within 
Drill: views assume that the data has already been translated into Drill 
vectors. Thus, views cannot be the solution for specifying that translation can 
occur. (This is a subtle point. Drill claims that views are a data translation 
solution, but clearly they cannot be the complete solution for JSON.)

h4. Alternative: ETL to Parquet

Cleary, there are far more JSON formats in the wild than Drill can support 
well, even with its many tricks and work-arounds. It may be that the best 
solution for production users is simply to convert JSON data into something 
that Drill can understand by performing a conversion of each application’s 
unique JSON format into a common standard via an ETL transform.

Since the data is being transformed, this is also an opportunity to gain 
performance by storing the converted data in a more performant format such as 
Parquet.

Although this choice conflicts with the message that Drill is schema-free and 
uses JSON as its native format, the approach does have the advantage of 
ensuring that actual queries work reliably in practice. Also, the ETL solution 
is consistent with the idea that Parquet is Drill's preferred format. It also 
acknowledges that may fine ETL tools exist (Spark, Python, commercial 
solutions) and that Drill need not expand into the ETL space.

h4. Conclusion

The net-net conclusion from all of the above is:

* If JSON is used with Drill, it must be very simple and follow Drill's JSON 
format rules as explained above.
* Otherwise, use a purpose-built ETL tool to convert JSON to Parquet and point 
Drill at the Parquet file instead of JSON.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)
> As noted below, the "net-net" conclusion for users is to use an ETL tool to 
> convert JSON to Parquet, then allow Drill to query the Parquet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

Reply via email to