[jira] [Commented] (DRILL-8204) Allow Provided Schema for HTTP Plugin in JSON Mode

ASF GitHub Bot (Jira) Mon, 02 May 2022 17:31:05 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530989#comment-17530989
 ]


ASF GitHub Bot commented on DRILL-8204:
---------------------------------------

cgivre commented on code in PR #2526:
URL: https://github.com/apache/drill/pull/2526#discussion_r863277815


##########
contrib/storage-http/JSON_Options.md:
##########
@@ -0,0 +1,125 @@
+# JSON Options and Configuration 
+
+Drill has a collection of JSON configuration options to allow you to configure 
how Drill interprets JSON files.  These are set at the global level, however 
the HTTP plugin
+allows you to configure these options individually per connection and override 
the Drill defaults.  The options are:
+
+* `allowNanInf`:  Configures the connection to interpret `NaN` and `Inf` values
+* `allTextMode`:  By default, Drill attempts to infer data types from JSON 
data. If the data is malformed, Drill may throw schema change exceptions. If 
your data is
+  inconsistent, you can enable `allTextMode` which when true, Drill will read 
all JSON values as strings, rather than try to infer the data type.
+* `readNumbersAsDouble`:  By default Drill will attempt to interpret integers, 
floating point number types and strings.  One challenge is when data is 
consistent, Drill may
+  throw schema change exceptions. In addition to `allTextMode`, you can make 
Drill less sensitive by setting the `readNumbersAsDouble` to `true` which 
causes Drill to read all
+  numeric fields in JSON data as `double` data type rather than trying to 
distinguish between ints and doubles.
+* `enableEscapeAnyChar`:  Allows a user to escape any character with a \
+* `skipMalformedRecords`:  Allows Drill to skip malformed records and recover 
without throwing exceptions.
+* `skipMalformedDocument`:  Allows Drill to skip entire malformed documents 
without throwing errors.
+
+All of these can be set by adding the `jsonOptions` to your connection 
configuration as shown below:
+
+```json
+
+"jsonOptions": {
+  "allTextMode": true, 
+  "readNumbersAsDouble": true
+}
+
+```
+
+## Schema Provisioning
+One of the challenges of querying APIs is inconsistent data.  Drill allows you 
to provide a schema for individual endpoints.  You can do this in one of three 
ways: 
+
+1. By providing a schema inline [See: Specifying Schema as Table Function 
Parameter](https://drill.apache.org/docs/plugin-configuration-basics/#specifying-the-schema-as-table-function-parameter)
+2. By providing a schema in the configuration for the endpoint.
+3. By providing a serialized TupleMetadata of the desired schema.  This is an 
advanced functionality and should only be used by advanced Drill users.
+
+The schema provisioning currently supports complex types of Arrays and Maps at 
any nesting level.
+
+### Example Schema Provisioning:
+```json
+"jsonOptions": {
+  "providedSchema": [

Review Comment:
   One more thing to mention.  The way I implemented it only supports the data 
types that are supported by the JSON reader. 





> Allow Provided Schema for HTTP Plugin in JSON Mode
> --------------------------------------------------
>
>                 Key: DRILL-8204
>                 URL: https://issues.apache.org/jira/browse/DRILL-8204
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Other
>    Affects Versions: 1.20.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 2.0.0
>
>
> One of the challenges of querying APIs is inconsistent data. Drill allows you 
> to provide a schema for individual endpoints. You can do this in one of two 
> ways: either by 
> providing a serialized TupleMetadata of the desired schema. This is an 
> advanced functionality and should only be used by advanced Drill users.
> The schema provisioning currently supports complex types of Arrays and Maps 
> at any nesting level.
> ### Example Schema Provisioning:
> ```json
> "jsonOptions": {
> "providedSchema": [
> {
> "fieldName": "int_field",
> "fieldType": "bigint"
> }, {
> "fieldName": "jsonField",
> "fieldType": "varchar",
> "properties": {
> "drill.json-mode":"json"
> }
> },{
> // Array field
> "fieldName": "stringField",
> "fieldType": "varchar",
> "isArray": true
> }, {
> // Map field
> "fieldName": "mapField",
> "fieldType": "map",
> "fields": [
> {
> "fieldName": "nestedField",
> "fieldType": "int"
> },{
> "fieldName": "nestedField2",
> "fieldType": "varchar"
> }
> ]
> }
> ]
> }
> ```
> ### Example Provisioning the Schema with a JSON String
> ```json
> "jsonOptions": {
> "jsonSchema": 
> "\{\"type\":\"tuple_schema\",\"columns\":[{\"name\":\"outer_map\",\"type\":\"STRUCT<`int_field`
>  BIGINT, `int_array` ARRAY<BIGINT>>\",\"mode\":\"REQUIRED\"}]}"
> }
> ```
> You can print out a JSON string of a schema with the Java code below. 
> ```java
> TupleMetadata schema = new SchemaBuilder()
> .addNullable("a", MinorType.BIGINT)
> .addNullable("m", MinorType.VARCHAR)
> .build();
> ColumnMetadata m = schema.metadata("m");
> m.setProperty(JsonLoader.JSON_MODE, JsonLoader.JSON_LITERAL_MODE);
> System.out.println(schema.jsonString());
> ```
> This will generate something like the JSON string below:
> ```json
> {
> "type":"tuple_schema",
> "columns":[
> {"name":"a","type":"BIGINT","mode":"OPTIONAL"},
> {"name":"m","type":"VARCHAR","mode":"OPTIONAL","properties":\{"drill.json-mode":"json"}
> }
> ]
> }
> ```
> ## Dealing With Inconsistent Schemas
> One of the major challenges of interacting with JSON data is when the schema 
> is inconsistent. Drill has a `UNION` data type which is marked as 
> experimental. At the time of
> writing, the HTTP plugin does not support the `UNION`, however supplying a 
> schema can solve a lot of those issues.
> ### Json Mode
> Drill offers the option of reading all JSON values as a string. While this 
> can complicate downstream analytics, it can also be a more memory-efficient 
> way of reading data with 
> inconsistent schema. Unfortunately, at the time of writing, JSON-mode is only 
> available with a provided schema. However, future work will allow this mode 
> to be enabled for 
> any JSON data.
> #### Enabling JSON Mode:
> You can enable JSON mode simply by adding the `drill.json-mode` property with 
> a value of `json` to a field, as shown below:
> ```json
> {
> "fieldName": "jsonField",
> "fieldType": "varchar",
> "properties": {
> "drill.json-mode": "json"
> }
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (DRILL-8204) Allow Provided Schema for HTTP Plugin in JSON Mode

Reply via email to