Charles Givre created DRILL-8204:
------------------------------------

             Summary: Allow Provided Schema for HTTP Plugin in JSON Mode
                 Key: DRILL-8204
                 URL: https://issues.apache.org/jira/browse/DRILL-8204
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - Other
    Affects Versions: 1.20.0
            Reporter: Charles Givre
            Assignee: Charles Givre
             Fix For: 2.0.0


One of the challenges of querying APIs is inconsistent data. Drill allows you 
to provide a schema for individual endpoints. You can do this in one of two 
ways: either by 
providing a serialized TupleMetadata of the desired schema. This is an advanced 
functionality and should only be used by advanced Drill users.

The schema provisioning currently supports complex types of Arrays and Maps at 
any nesting level.

### Example Schema Provisioning:
```json
"jsonOptions": {
"providedSchema": [
{
"fieldName": "int_field",
"fieldType": "bigint"
}, {
"fieldName": "jsonField",
"fieldType": "varchar",
"properties": {
"drill.json-mode":"json"
}
},{
// Array field
"fieldName": "stringField",
"fieldType": "varchar",
"isArray": true
}, {
// Map field
"fieldName": "mapField",
"fieldType": "map",
"fields": [
{
"fieldName": "nestedField",
"fieldType": "int"
},{
"fieldName": "nestedField2",
"fieldType": "varchar"
}
]
}
]
}
```

### Example Provisioning the Schema with a JSON String
```json
"jsonOptions": {
"jsonSchema": 
"\{\"type\":\"tuple_schema\",\"columns\":[{\"name\":\"outer_map\",\"type\":\"STRUCT<`int_field`
 BIGINT, `int_array` ARRAY<BIGINT>>\",\"mode\":\"REQUIRED\"}]}"
}
```

You can print out a JSON string of a schema with the Java code below. 

```java
TupleMetadata schema = new SchemaBuilder()
.addNullable("a", MinorType.BIGINT)
.addNullable("m", MinorType.VARCHAR)
.build();
ColumnMetadata m = schema.metadata("m");
m.setProperty(JsonLoader.JSON_MODE, JsonLoader.JSON_LITERAL_MODE);

System.out.println(schema.jsonString());
```

This will generate something like the JSON string below:

```json
{
"type":"tuple_schema",
"columns":[
{"name":"a","type":"BIGINT","mode":"OPTIONAL"},
{"name":"m","type":"VARCHAR","mode":"OPTIONAL","properties":\{"drill.json-mode":"json"}
}
]
}
```

## Dealing With Inconsistent Schemas
One of the major challenges of interacting with JSON data is when the schema is 
inconsistent. Drill has a `UNION` data type which is marked as experimental. At 
the time of
writing, the HTTP plugin does not support the `UNION`, however supplying a 
schema can solve a lot of those issues.

### Json Mode
Drill offers the option of reading all JSON values as a string. While this can 
complicate downstream analytics, it can also be a more memory-efficient way of 
reading data with 
inconsistent schema. Unfortunately, at the time of writing, JSON-mode is only 
available with a provided schema. However, future work will allow this mode to 
be enabled for 
any JSON data.

#### Enabling JSON Mode:
You can enable JSON mode simply by adding the `drill.json-mode` property with a 
value of `json` to a field, as shown below:

```json
{
"fieldName": "jsonField",
"fieldType": "varchar",
"properties": {
"drill.json-mode": "json"
}
}
```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to