[
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305699#comment-16305699
]
Paul Rogers commented on DRILL-6035:
------------------------------------
h4. Solutions
Drill's JSON support is mature; it is used successfully by a wide range of
users. Presumably, the issues discussed above are only theoretical; practical
users do not encounter them. (If they did, we'd see many more open bugs in this
area.)
Further, though JSON started as Drill's reference data source, real-world
experience has shown that Parquet has assumed that role; JSON is not used only
lightly. Hence, there may little interest or value in addressing the above
issues.
Still, it is worth presenting some options if it ever becomes necessary to
improve Drill's JSON support.
h4. Futures: Stats
Drill can address its inability to predict the future by converting the problem
into one of learning from the past. Specifically, Drill can correctly pick the
type of each column if Drill were to scan the entire collection of JSON files
before executing the first query, and recording the inferred type of each
column.
Drill already has (a prototype of) a statistics (stats) collection mechanism.
Drill could extend this mechanism to scan the full set of JSON files. For each
column, implement inference rules such as:
* Record the data types seen for each column. If a column is mostly nulls (or
missing), even a single occurrence, even within terabytes of files, would be
enough to tell Drill the type.
* When a column has a conflicting scalar type (both string and number, say),
record the “most permissive” type. That is, do an automatic per-column “text
mode” for scalar conflicts. (Or, for numbers, simply pick FLOAT8 if even a
single value has a decimal point.)
* For numbers, use the value of the largest number to infer type. (INT, BITINT,
FLOAT8 or DECIMAL).
* If a column has both primitive and structured values (both a string and map,
say), then mark the column as needed an “enhanced” text mode: read the complex
type as a JSON string.
The above simply extends existing schema inference beyond the first row of each
file to the entire collection of files. Clearly there is a cost (the need to
scan the entire set of JSON files, and keep the stats up to date via repeated
scans.)
The solution is not perfect, but may reduce problems by some percentage, say
80-90%. This solution has the advantage of living up to Drill’s core promise of
being “schema-free”: the user does not provide a schema, though Drill infers a
schema from actual data.
h4. Futures: Schema Metadata
The stats solution is still not complete: it cannot address the issue of
intent: what does the JSON actually represent? It can only deal with the entire
body of JSON at the syntax level.
The next level of solution allows the user to specify their intent. That is, to
specify rules for mapping their application-specific JSON encoding into the
relational model used by Drill.
Many of the difficulties in JSON come from ambiguities that limit Drill's
ability to map JSON into a fixed set of tabular columns. One straightforward
solution is to supply a schema (which may include mapping rules) that resolve
the ambiguities for each application.
Advantages of a defined schema:
* Code generation vs. interpretation — performance
* Avoid ambiguities — fewer failures
The metadata might include simple information such as:
* The declared type of each column. (Drill uses that type even if values are
null or missing, eliminating the need to guess.)
* Columns to be ignored in a `SELECT *` query. (Such as internal metadata
columns.)
The metadata layer would be much more powerful if it also included mapping
rules for cases such as:
* Identify that an array is represents a tuple, and provide column names and
types for each array element.
* Identify that a map structure is used to encode the name and type of a value,
along with rules to map that value to a Drill column.
An example of the first case above:
{noformat}
Metadata: Array as a row: [cust_id INT, name: VARCHAR….]
Data: [101, “fred”, …]
{noformat}
Example of the second case:
{noformat}
Mapping rule: for key {“$type:” type, “$value”: value},
Create a column named <key> of type <type> with value <value>.
Data: cust_id {“$type”: “INT”, value: 101}
{noformat}
The need for the mapping (and the required set of features) will be made
clearer by gathering many samples of JSON “in the wild” to learn how
application developers actually encode their data into JSON.
Possible sources of a schema:
* Hive
* Schema file stored with the data
* Metadata system as in Impala
* Custom mapping code that is a "data format" plugin on top of the JSON format
plugin
Drill uses views for this role today. But, views introduce a chicken-and-egg
problem and cannot be the (whole) solution for JSON. Views are a layer within
Drill: views assume that the data has already been translated into Drill
vectors. Thus, views cannot be the solution for specifying that translation can
occur. (This is a subtle point. Drill claims that views are a data translation
solution, but clearly they cannot be the complete solution for JSON.)
h4. Alternative: ETL to Parquet
Cleary, there are far more JSON formats in the wild than Drill can support
well, even with its many tricks and work-arounds. It may be that the best
solution for production users is simply to convert JSON data into something
that Drill can understand by performing a conversion of each application’s
unique JSON format into a common standard via an ETL transform.
Since the data is being transformed, this is also an opportunity to gain
performance by storing the converted data in a more performant format such as
Parquet.
Although this choice conflicts with the message that Drill is schema-free and
uses JSON as its native format, the approach does have the advantage of
ensuring that actual queries work reliably in practice. Also, the ETL solution
is consistent with the idea that Parquet is Drill's preferred format. It also
acknowledges that may fine ETL tools exist (Spark, Python, commercial
solutions) and that Drill need not expand into the ETL space.
h4. Conclusion
The net-net conclusion from all of the above is:
* If JSON is used with Drill, it must be very simple and follow Drill's JSON
format rules as explained above.
* Otherwise, use a purpose-built ETL tool to convert JSON to Parquet and point
Drill at the Parquet file instead of JSON.
> Specify Drill's JSON behavior
> -----------------------------
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.13.0
> Reporter: Paul Rogers
> Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests
> that Drill may have limitations in the JSON that Drill supports. This ticket
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed
> specifications that clarifies what Drill does and does not support (or what
> is should and should not support.)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)