[
https://issues.apache.org/jira/browse/DRILL-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530746#comment-17530746
]
ASF GitHub Bot commented on DRILL-8204:
---------------------------------------
jnturton commented on PR #2526:
URL: https://github.com/apache/drill/pull/2526#issuecomment-1114915565
> For now, drill metastore supports only easy file formats and parquet, but
in the future, it could handle the HTTP plugin.
@vvysotskyi I think that the HTTP plugin makes use of the same readers as
the easy format plugins (CSV, JSON, XML)? Does that mean that metastore might
already work with HTTP, or are there likely to be pieces missing?
> Allow Provided Schema for HTTP Plugin in JSON Mode
> --------------------------------------------------
>
> Key: DRILL-8204
> URL: https://issues.apache.org/jira/browse/DRILL-8204
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Other
> Affects Versions: 1.20.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 2.0.0
>
>
> One of the challenges of querying APIs is inconsistent data. Drill allows you
> to provide a schema for individual endpoints. You can do this in one of two
> ways: either by
> providing a serialized TupleMetadata of the desired schema. This is an
> advanced functionality and should only be used by advanced Drill users.
> The schema provisioning currently supports complex types of Arrays and Maps
> at any nesting level.
> ### Example Schema Provisioning:
> ```json
> "jsonOptions": {
> "providedSchema": [
> {
> "fieldName": "int_field",
> "fieldType": "bigint"
> }, {
> "fieldName": "jsonField",
> "fieldType": "varchar",
> "properties": {
> "drill.json-mode":"json"
> }
> },{
> // Array field
> "fieldName": "stringField",
> "fieldType": "varchar",
> "isArray": true
> }, {
> // Map field
> "fieldName": "mapField",
> "fieldType": "map",
> "fields": [
> {
> "fieldName": "nestedField",
> "fieldType": "int"
> },{
> "fieldName": "nestedField2",
> "fieldType": "varchar"
> }
> ]
> }
> ]
> }
> ```
> ### Example Provisioning the Schema with a JSON String
> ```json
> "jsonOptions": {
> "jsonSchema":
> "\{\"type\":\"tuple_schema\",\"columns\":[{\"name\":\"outer_map\",\"type\":\"STRUCT<`int_field`
> BIGINT, `int_array` ARRAY<BIGINT>>\",\"mode\":\"REQUIRED\"}]}"
> }
> ```
> You can print out a JSON string of a schema with the Java code below.
> ```java
> TupleMetadata schema = new SchemaBuilder()
> .addNullable("a", MinorType.BIGINT)
> .addNullable("m", MinorType.VARCHAR)
> .build();
> ColumnMetadata m = schema.metadata("m");
> m.setProperty(JsonLoader.JSON_MODE, JsonLoader.JSON_LITERAL_MODE);
> System.out.println(schema.jsonString());
> ```
> This will generate something like the JSON string below:
> ```json
> {
> "type":"tuple_schema",
> "columns":[
> {"name":"a","type":"BIGINT","mode":"OPTIONAL"},
> {"name":"m","type":"VARCHAR","mode":"OPTIONAL","properties":\{"drill.json-mode":"json"}
> }
> ]
> }
> ```
> ## Dealing With Inconsistent Schemas
> One of the major challenges of interacting with JSON data is when the schema
> is inconsistent. Drill has a `UNION` data type which is marked as
> experimental. At the time of
> writing, the HTTP plugin does not support the `UNION`, however supplying a
> schema can solve a lot of those issues.
> ### Json Mode
> Drill offers the option of reading all JSON values as a string. While this
> can complicate downstream analytics, it can also be a more memory-efficient
> way of reading data with
> inconsistent schema. Unfortunately, at the time of writing, JSON-mode is only
> available with a provided schema. However, future work will allow this mode
> to be enabled for
> any JSON data.
> #### Enabling JSON Mode:
> You can enable JSON mode simply by adding the `drill.json-mode` property with
> a value of `json` to a field, as shown below:
> ```json
> {
> "fieldName": "jsonField",
> "fieldType": "varchar",
> "properties": {
> "drill.json-mode": "json"
> }
> }
> ```
--
This message was sent by Atlassian Jira
(v8.20.7#820007)