[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format

GitBox Thu, 16 Jan 2020 13:27:49 -0800

jihoonson commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367659050


 ##########
 File path: docs/ingestion/data-formats.md
 ##########
 @@ -63,155 +65,968 @@ _TSV (Delimited)_
 
 Note that the CSV and TSV data do not contain column heads. This becomes 
important when you specify the data for ingesting.
 
+Besides text formats, Druid also supports binary formats such as [Orc](#orc) 
and [Parquet](#parquet) formats.
+
 ## Custom Formats
 
 Druid supports custom data formats and can use the `Regex` parser or the 
`JavaScript` parsers to parse these formats. Please note that using any of 
these parsers for
 parsing data will not be as efficient as writing a native Java parser or using 
an external stream processor. We welcome contributions of new Parsers.
 
-## Configuration
+## Input Format
+
+> The Input Format is a new way to specify the data format of your input data 
which was introduced in 0.17.0.
+Unfortunately, the Input Format doesn't support all data formats or ingestion 
methods supported by Druid yet.
+Especially if you want to use the Hadoop ingestion, you still need to use the 
[Parser](#parser-deprecated).
+If your data is formatted in some format not listed in this section, please 
consider using the Parser instead.
 
-All forms of Druid ingestion require some form of schema object. The format of 
the data to be ingested is specified using the`parseSpec` entry in your 
`dataSchema`.
+All forms of Druid ingestion require some form of schema object. The format of 
the data to be ingested is specified using the `inputFormat` entry in your 
[`ioConfig`](index.md#ioconfig).
 
 ### JSON
 
+The `inputFormat` to load data of JSON format. An example is:
+
+```json
+"ioConfig": {
+  "inputFormat": {
+    "type": "json"
+  },
+  ...
+}
+```
+
+The JSON `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `json`. | yes |
+| flattenSpec | JSON Object | Specifies flattening configuration for nested 
JSON data. See [`flattenSpec`](#flattenspec) for more info. | no |
+| featureSpec | JSON Object | [JSON parser 
features](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) 
supported by Jackson library. Those features will be applied when parsing the 
input JSON data. | no |
+
+### CSV
+
+The `inputFormat` to load data of the CSV format. An example is:
+
+```json
+"ioConfig": {
+  "inputFormat": {
+    "type": "csv",
+    "columns" : 
["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"]
+  },
+  ...
+}
+```
+
+The CSV `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `csv`. | yes |
+| listDelimiter | String | A custom delimiter for multi-value dimensions. | no 
(default == ctrl+A) |
+| columns | JSON array | Specifies the columns of the data. The columns should 
be in the same order with the columns of your data. | yes if 
`findColumnsFromHeader` is false or missing |
+| findColumnsFromHeader | Boolean | If this is set, the task will find the 
column names from the header row. Note that `skipHeaderRows` will be applied 
before finding column names from the header. For example, if you set 
`skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip 
the first two lines and then extract column information from the third line. 
`columns` will be ignored if this is set to true. | no (default = false if 
`columns` is set; otherwise null) |
+| skipHeaderRows | Integer | If this is set, the task will skip the first 
`skipHeaderRows` rows. | no (default = 0) |
+
+### TSV (Delimited)
+
+```json
+"ioConfig": {
+  "inputFormat": {
+    "type": "tsv",
+    "columns" : 
["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
+    "delimiter":"|"
+  },
+  ...
+}
+```
+
+The `inputFormat` to load data of a delimited format. An example is:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `tsv`. | yes |
+| delimiter | String | A custom delimiter for data values. | no (default == 
`\t`) |
+| listDelimiter | String | A custom delimiter for multi-value dimensions. | no 
(default == ctrl+A) |
+| columns | JSON array | Specifies the columns of the data. The columns should 
be in the same order with the columns of your data. | yes if 
`findColumnsFromHeader` is false or missing |
+| findColumnsFromHeader | Boolean | If this is set, the task will find the 
column names from the header row. Note that `skipHeaderRows` will be applied 
before finding column names from the header. For example, if you set 
`skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip 
the first two lines and then extract column information from the third line. 
`columns` will be ignored if this is set to true. | no (default = false if 
`columns` is set; otherwise null) |
+| skipHeaderRows | Integer | If this is set, the task will skip the first 
`skipHeaderRows` rows. | no (default = 0) |
+
+Be sure to change the `delimiter` to the appropriate delimiter for your data. 
Like CSV, you must specify the columns and which subset of the columns you want 
indexed.
+
+### ORC
+
+> You need to include the 
[`druid-orc-extensions`](../development/extensions-core/orc.md) as an extension 
to use the ORC input format.
+
+> If you are considering upgrading from earlier than 0.15.0 to 0.15.0 or a 
higher version,
+> please read [Migration from 'contrib' 
extension](../development/extensions-core/orc.md#migration-from-contrib-extension)
 carefully.
+
+The `inputFormat` to load data of ORC format. An example is:
+
+```json
+"ioConfig": {
+  "inputFormat": {
+    "type": "orc",
+    "flattenSpec": {
+      "useFieldDiscovery": true,
+      "fields": [
+        {
+          "type": "path",
+          "name": "nested",
+          "expr": "$.path.to.nested"
+        }
+      ]
+    }
+    "binaryAsString": false
+  },
+  ...
+}
+```
+
+The ORC `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `json`. | yes |
+| flattenSpec | JSON Object | Specifies flattening configuration for nested 
JSON data. See [`flattenSpec`](#flattenspec) for more info. | no |
+| binaryAsString | Boolean | Specifies if the binary orc column which is not 
logically marked as a string should be treated as a UTF-8 encoded string. | no 
(default == false) |
+
+### Parquet
+
+> You need to include the 
[`druid-parquet-extensions`](../development/extensions-core/parquet.md) as an 
extension to use the Parquet input format.
+
+The `inputFormat` to load data of Parquet format. An example is:
+
+```json
+"ioConfig": {
+  "inputFormat": {
+    "type": "parquet",
+    "flattenSpec": {
+      "useFieldDiscovery": true,
+      "fields": [
+        {
+          "type": "path",
+          "name": "nested",
+          "expr": "$.path.to.nested"
+        }
+      ]
+    }
+    "binaryAsString": false
+  },
+  ...
+}
+```
+
+The Parquet `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+|type| String| This should be set to `parquet` to read Parquet file| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Parquet file. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
+| binaryAsString | Boolean | Specifies if the bytes parquet column which is 
not logically marked as a string or enum type should be treated as a UTF-8 
encoded string. | no (default == false) |
+
+### FlattenSpec
+
+The `flattenSpec` is located in `inputFormat` → `flattenSpec` and is 
responsible for
+bridging the gap between potentially nested input data (such as JSON, Avro, 
etc) and Druid's flat data model.
+An example `flattenSpec` is:
+
+```json
+"flattenSpec": {
+  "useFieldDiscovery": true,
+  "fields": [
+    { "name": "baz", "type": "root" },
+    { "name": "foo_bar", "type": "path", "expr": "$.foo.bar" },
+    { "name": "first_food", "type": "jq", "expr": ".thing.food[1]" }
+  ]
+}
+```
+> Conceptually, after input data records are read, the `flattenSpec` is 
applied first before
+> any other specs such as [`timestampSpec`](./index.md#timestampspec), 
[`transformSpec`](./index.md#transformspec),
+> [`dimensionsSpec`](./index.md#dimensionsspec), or 
[`metricsSpec`](./index.md#metricsspec). Keep this in mind when writing
+> your ingestion spec.
+
+Flattening is only supported for [data formats](data-formats.md) that support 
nesting, including `avro`, `json`, `orc`,
+and `parquet`.
+
+A `flattenSpec` can have the following components:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| useFieldDiscovery | If true, interpret all root-level fields as available 
fields for usage by [`timestampSpec`](./index.md#timestampspec), 
[`transformSpec`](./index.md#transformspec), 
[`dimensionsSpec`](./index.md#dimensionsspec), and 
[`metricsSpec`](./index.md#metricsspec).<br><br>If false, only explicitly 
specified fields (see `fields`) will be available for use. | `true` |
+| fields | Specifies the fields of interest and how they are accessed. [See 
below for details.](#field-flattening-specifications) | `[]` |
+
+#### Field flattening specifications
+
+Each entry in the `fields` list can have the following components:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| type | Options are as follows:<br><br><ul><li>`root`, referring to a field 
at the root level of the record. Only really useful if `useFieldDiscovery` is 
false.</li><li>`path`, referring to a field using 
[JsonPath](https://github.com/jayway/JsonPath) notation. Supported by most data 
formats that offer nesting, including `avro`, `json`, `orc`, and 
`parquet`.</li><li>`jq`, referring to a field using 
[jackson-jq](https://github.com/eiiches/jackson-jq) notation. Only supported 
for the `json` format.</li></ul> | none (required) |
+| name | Name of the field after flattening. This name can be referred to by 
the [`timestampSpec`](./index.md#timestampspec), 
[`transformSpec`](./index.md#transformspec), 
[`dimensionsSpec`](./index.md#dimensionsspec), and 
[`metricsSpec`](./index.md#metricsspec).| none (required) |
+| expr | Expression for accessing the field while flattening. For type `path`, 
this should be [JsonPath](https://github.com/jayway/JsonPath). For type `jq`, 
this should be [jackson-jq](https://github.com/eiiches/jackson-jq) notation. 
For other types, this parameter is ignored. | none (required for types `path` 
and `jq`) |
+
+#### Notes on flattening
+
+* For convenience, when defining a root-level field, it is possible to define 
only the field name, as a string, instead of a JSON object. For example, 
`{"name": "baz", "type": "root"}` is equivalent to `"baz"`.
+* Enabling `useFieldDiscovery` will only automatically detect "simple" fields 
at the root level that correspond to data types that Druid supports. This 
includes strings, numbers, and lists of strings or numbers. Other types will 
not be automatically detected, and must be specified explicitly in the `fields` 
list.
+* Duplicate field `name`s are not allowed. An exception will be thrown.
+* If `useFieldDiscovery` is enabled, any discovered field with the same name 
as one already defined in the `fields` list will be skipped, rather than added 
twice.
+* [http://jsonpath.herokuapp.com/](http://jsonpath.herokuapp.com/) is useful 
for testing `path`-type expressions.
+* jackson-jq supports a subset of the full 
[jq](https://stedolan.github.io/jq/) syntax.  Please refer to the [jackson-jq 
documentation](https://github.com/eiiches/jackson-jq) for details.
+
+## Parser (Deprecated)
 
 Review comment:
   Good point. I changed as below:
   
   ```
   > The Parser is deprecated for [native batch tasks](./native-batch.md), 
[Kafka indexing service](../development/extensions-core/kafka-ingestion.md),
   and [Kinesis indexing 
service](../development/extensions-core/kinesis-ingestion.md).
   Consider using the [input format](#input-format) instead for these types of 
ingestion.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] jihoonson commented on a change in pull request #9171: Doc update for the new input source and the new input format

Reply via email to