[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305699#comment-16305699 ]
Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:57 PM: -------------------------------------------------------------- h4. Solutions Drill's JSON support is mature; it is used successfully by a wide range of users. Presumably, the issues discussed above are only theoretical; practical users do not encounter them. (If they did, we'd see many more open bugs in this area.) Further, though JSON started as Drill's reference data source, real-world experience has shown that Parquet has assumed that role; JSON is not used only lightly. Hence, there may little interest or value in addressing the above issues. Still, it is worth presenting some options if it ever becomes necessary to improve Drill's JSON support. h4. Futures: Stats Drill can address its inability to predict the future by converting the problem into one of learning from the past. Specifically, Drill can correctly pick the type of each column if Drill were to scan the entire collection of JSON files before executing the first query, and recording the inferred type of each column. Drill already has (a prototype of) a statistics (stats) collection mechanism. Drill could extend this mechanism to scan the full set of JSON files. For each column, implement inference rules such as: * Record the data types seen for each column. If a column is mostly nulls (or missing), even a single occurrence, even within terabytes of files, would be enough to tell Drill the type. * When a column has a conflicting scalar type (both string and number, say), record the “most permissive” type. That is, do an automatic per-column “text mode” for scalar conflicts. (Or, for numbers, simply pick FLOAT8 if even a single value has a decimal point.) * For numbers, use the value of the largest number to infer type. (INT, BITINT, FLOAT8 or DECIMAL). * If a column has both primitive and structured values (both a string and map, say), then mark the column as needed an “enhanced” text mode: read the complex type as a JSON string. The above simply extends existing schema inference beyond the first row of each file to the entire collection of files. Clearly there is a cost (the need to scan the entire set of JSON files, and keep the stats up to date via repeated scans.) The solution is not perfect, but may reduce problems by some percentage, say 80-90%. This solution has the advantage of living up to Drill’s core promise of being “schema-free”: the user does not provide a schema, though Drill infers a schema from actual data. h4. Futures: Schema Metadata The stats solution is still not complete: it cannot address the issue of intent: what does the JSON actually represent? It can only deal with the entire body of JSON at the syntax level. The next level of solution allows the user to specify their intent. That is, to specify rules for mapping their application-specific JSON encoding into the relational model used by Drill. Many of the difficulties in JSON come from ambiguities that limit Drill's ability to map JSON into a fixed set of tabular columns. One straightforward solution is to supply a schema (which may include mapping rules) that resolve the ambiguities for each application. Advantages of a defined schema: * Code generation vs. interpretation — performance * Avoid ambiguities — fewer failures The metadata might include simple information such as: * The declared type of each column. (Drill uses that type even if values are null or missing, eliminating the need to guess.) * Columns to be ignored in a `SELECT *` query. (Such as internal metadata columns.) The metadata layer would be much more powerful if it also included mapping rules for cases such as: * Identify that an array is represents a tuple, and provide column names and types for each array element. * Identify that a map structure is used to encode the name and type of a value, along with rules to map that value to a Drill column. An example of the first case above: {noformat} Metadata: Array as a row: [cust_id INT, name: VARCHAR….] Data: [101, “fred”, …] {noformat} Example of the second case: {noformat} Mapping rule: for key {“$type:” type, “$value”: value}, Create a column named <key> of type <type> with value <value>. Data: cust_id {“$type”: “INT”, value: 101} {noformat} The need for the mapping (and the required set of features) will be made clearer by gathering many samples of JSON “in the wild” to learn how application developers actually encode their data into JSON. Possible sources of a schema: * Hive * Schema file stored with the data * Metadata system as in Impala * Custom mapping code that is a "data format" plugin on top of the JSON format plugin Drill uses views for this role today. But, views introduce a chicken-and-egg problem and cannot be the (whole) solution for JSON. Views are a layer within Drill: views assume that the data has already been translated into Drill vectors. Thus, views cannot be the solution for specifying that translation can occur. (This is a subtle point. Drill claims that views are a data translation solution, but clearly they cannot be the complete solution for JSON.) h4. Alternative: ETL to Parquet Cleary, there are far more JSON formats in the wild than Drill can support well, even with its many tricks and work-arounds. It may be that the best solution for production users is simply to convert JSON data into something that Drill can understand by performing a conversion of each application’s unique JSON format into a common standard via an ETL transform. Since the data is being transformed, this is also an opportunity to gain performance by storing the converted data in a more performant format such as Parquet. Although this choice conflicts with the message that Drill is schema-free and uses JSON as its native format, the approach does have the advantage of ensuring that actual queries work reliably in practice. Also, the ETL solution is consistent with the idea that Parquet is Drill's preferred format. It also acknowledges that may fine ETL tools exist (Spark, Python, commercial solutions) and that Drill need not expand into the ETL space. was (Author: paul.rogers): h4. Solutions Drill's JSON support is mature; it is used successfully by a wide range of users. Presumably, the issues discussed above are only theoretical; practical users do not encounter them. (If they did, we'd see many more open bugs in this area.) Further, though JSON started as Drill's reference data source, real-world experience has shown that Parquet has assumed that role; JSON is not used only lightly. Hence, there may little interest or value in addressing the above issues. Still, it is worth presenting some options if it ever becomes necessary to improve Drill's JSON support. h4. Futures: Stats Drill can address its inability to predict the future by converting the problem into one of learning from the past. Specifically, Drill can correctly pick the type of each column if Drill were to scan the entire collection of JSON files before executing the first query, and recording the inferred type of each column. Drill already has (a prototype of) a statistics (stats) collection mechanism. Drill could extend this mechanism to scan the full set of JSON files. For each column, implement inference rules such as: * Record the data types seen for each column. If a column is mostly nulls (or missing), even a single occurrence, even within terabytes of files, would be enough to tell Drill the type. * When a column has a conflicting scalar type (both string and number, say), record the “most permissive” type. That is, do an automatic per-column “text mode” for scalar conflicts. (Or, for numbers, simply pick FLOAT8 if even a single value has a decimal point.) * For numbers, use the value of the largest number to infer type. (INT, BITINT, FLOAT8 or DECIMAL). * If a column has both primitive and structured values (both a string and map, say), then mark the column as needed an “enhanced” text mode: read the complex type as a JSON string. The above simply extends existing schema inference beyond the first row of each file to the entire collection of files. Clearly there is a cost (the need to scan the entire set of JSON files, and keep the stats up to date via repeated scans.) The solution is not perfect, but may reduce problems by some percentage, say 80-90%. This solution has the advantage of living up to Drill’s core promise of being “schema-free”: the user does not provide a schema, though Drill infers a schema from actual data. h4. Futures: Schema Metadata The stats solution is still not complete: it cannot address the issue of intent: what does the JSON actually represent? It can only deal with the entire body of JSON at the syntax level. The next level of solution allows the user to specify their intent. That is, to specify rules for mapping their application-specific JSON encoding into the relational model used by Drill. Many of the difficulties in JSON come from ambiguities that limit Drill's ability to map JSON into a fixed set of tabular columns. One straightforward solution is to supply a schema (which may include mapping rules) that resolve the ambiguities for each application. Advantages of a defined schema: * Code generation vs. interpretation — performance * Avoid ambiguities — fewer failures The metadata might include simple information such as: * The declared type of each column. (Drill uses that type even if values are null or missing, eliminating the need to guess.) * Columns to be ignored in a `SELECT *` query. (Such as internal metadata columns.) The metadata layer would be much more powerful if it also included mapping rules for cases such as: * Identify that an array is represents a tuple, and provide column names and types for each array element. * Identify that a map structure is used to encode the name and type of a value, along with rules to map that value to a Drill column. An example of the first case above: {noformat} Metadata: Array as a row: [cust_id INT, name: VARCHAR….] Data: [101, “fred”, …] {noformat} Example of the second case: {noformat} Mapping rule: for key {“$type:” type, “$value”: value}, Create a column named <key> of type <type> with value <value>. Data: cust_id {“$type”: “INT”, value: 101} {noformat} The need for the mapping (and the required set of features) will be made clearer by gathering many samples of JSON “in the wild” to learn how application developers actually encode their data into JSON. Possible sources of a schema: * Hive * Schema file stored with the data * Metadata system as in Impala * Custom mapping code that is a "data format" plugin on top of the JSON format plugin Drill uses views for this role today. But, views introduce a chicken-and-egg problem and cannot be the (whole) solution for JSON. Views are a layer within Drill: views assume that the data has already been translated into Drill vectors. Thus, views cannot be the solution for specifying that translation can occur. (This is a subtle point. Drill claims that views are a data translation solution, but clearly they cannot be the complete solution for JSON.) h4. Alternative: ETL to Parquet Cleary, there are far more JSON formats in the wild than Drill can support well, even with its many tricks and work-arounds. It may be that the best solution for production users is simply to convert JSON data into something that Drill can understand by performing a conversion of each application’s unique JSON format into a common standard via an ETL transform. Since the data is being transformed, this is also an opportunity to gain performance by storing the converted data in a more performant format such as Parquet. Although this choice conflicts with the message that Drill is schema-free and uses JSON as its native format, the approach does have the advantage of ensuring that actual queries work reliably in practice. Also, the ETL solution is consistent with the idea that Parquet is Drill's preferred format. It also acknowledges that may fine ETL tools exist (Spark, Python, commercial solutions) and that Drill need not expand into the ETL space. h4. Conclusion The net-net conclusion from all of the above is: * If JSON is used with Drill, it must be very simple and follow Drill's JSON format rules as explained above. * Otherwise, use a purpose-built ETL tool to convert JSON to Parquet and point Drill at the Parquet file instead of JSON. > Specify Drill's JSON behavior > ----------------------------- > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.13.0 > Reporter: Paul Rogers > Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) > As noted below, the "net-net" conclusion for users is to use an ETL tool to > convert JSON to Parquet, then allow Drill to query the Parquet. -- This message was sent by Atlassian JIRA (v6.4.14#64029)