jihoonson commented on a change in pull request #9171: Doc update for the new
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367659050
##########
File path: docs/ingestion/data-formats.md
##########
@@ -63,155 +65,968 @@ _TSV (Delimited)_
Note that the CSV and TSV data do not contain column heads. This becomes
important when you specify the data for ingesting.
+Besides text formats, Druid also supports binary formats such as [Orc](#orc)
and [Parquet](#parquet) formats.
+
## Custom Formats
Druid supports custom data formats and can use the `Regex` parser or the
`JavaScript` parsers to parse these formats. Please note that using any of
these parsers for
parsing data will not be as efficient as writing a native Java parser or using
an external stream processor. We welcome contributions of new Parsers.
-## Configuration
+## Input Format
+
+> The Input Format is a new way to specify the data format of your input data
which was introduced in 0.17.0.
+Unfortunately, the Input Format doesn't support all data formats or ingestion
methods supported by Druid yet.
+Especially if you want to use the Hadoop ingestion, you still need to use the
[Parser](#parser-deprecated).
+If your data is formatted in some format not listed in this section, please
consider using the Parser instead.
-All forms of Druid ingestion require some form of schema object. The format of
the data to be ingested is specified using the`parseSpec` entry in your
`dataSchema`.
+All forms of Druid ingestion require some form of schema object. The format of
the data to be ingested is specified using the `inputFormat` entry in your
[`ioConfig`](index.md#ioconfig).
### JSON
+The `inputFormat` to load data of JSON format. An example is:
+
+```json
+"ioConfig": {
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+}
+```
+
+The JSON `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `json`. | yes |
+| flattenSpec | JSON Object | Specifies flattening configuration for nested
JSON data. See [`flattenSpec`](#flattenspec) for more info. | no |
+| featureSpec | JSON Object | [JSON parser
features](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features)
supported by Jackson library. Those features will be applied when parsing the
input JSON data. | no |
+
+### CSV
+
+The `inputFormat` to load data of the CSV format. An example is:
+
+```json
+"ioConfig": {
+ "inputFormat": {
+ "type": "csv",
+ "columns" :
["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"]
+ },
+ ...
+}
+```
+
+The CSV `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `csv`. | yes |
+| listDelimiter | String | A custom delimiter for multi-value dimensions. | no
(default == ctrl+A) |
+| columns | JSON array | Specifies the columns of the data. The columns should
be in the same order with the columns of your data. | yes if
`findColumnsFromHeader` is false or missing |
+| findColumnsFromHeader | Boolean | If this is set, the task will find the
column names from the header row. Note that `skipHeaderRows` will be applied
before finding column names from the header. For example, if you set
`skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip
the first two lines and then extract column information from the third line.
`columns` will be ignored if this is set to true. | no (default = false if
`columns` is set; otherwise null) |
+| skipHeaderRows | Integer | If this is set, the task will skip the first
`skipHeaderRows` rows. | no (default = 0) |
+
+### TSV (Delimited)
+
+```json
+"ioConfig": {
+ "inputFormat": {
+ "type": "tsv",
+ "columns" :
["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
+ "delimiter":"|"
+ },
+ ...
+}
+```
+
+The `inputFormat` to load data of a delimited format. An example is:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `tsv`. | yes |
+| delimiter | String | A custom delimiter for data values. | no (default ==
`\t`) |
+| listDelimiter | String | A custom delimiter for multi-value dimensions. | no
(default == ctrl+A) |
+| columns | JSON array | Specifies the columns of the data. The columns should
be in the same order with the columns of your data. | yes if
`findColumnsFromHeader` is false or missing |
+| findColumnsFromHeader | Boolean | If this is set, the task will find the
column names from the header row. Note that `skipHeaderRows` will be applied
before finding column names from the header. For example, if you set
`skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip
the first two lines and then extract column information from the third line.
`columns` will be ignored if this is set to true. | no (default = false if
`columns` is set; otherwise null) |
+| skipHeaderRows | Integer | If this is set, the task will skip the first
`skipHeaderRows` rows. | no (default = 0) |
+
+Be sure to change the `delimiter` to the appropriate delimiter for your data.
Like CSV, you must specify the columns and which subset of the columns you want
indexed.
+
+### ORC
+
+> You need to include the
[`druid-orc-extensions`](../development/extensions-core/orc.md) as an extension
to use the ORC input format.
+
+> If you are considering upgrading from earlier than 0.15.0 to 0.15.0 or a
higher version,
+> please read [Migration from 'contrib'
extension](../development/extensions-core/orc.md#migration-from-contrib-extension)
carefully.
+
+The `inputFormat` to load data of ORC format. An example is:
+
+```json
+"ioConfig": {
+ "inputFormat": {
+ "type": "orc",
+ "flattenSpec": {
+ "useFieldDiscovery": true,
+ "fields": [
+ {
+ "type": "path",
+ "name": "nested",
+ "expr": "$.path.to.nested"
+ }
+ ]
+ }
+ "binaryAsString": false
+ },
+ ...
+}
+```
+
+The ORC `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | This should say `json`. | yes |
+| flattenSpec | JSON Object | Specifies flattening configuration for nested
JSON data. See [`flattenSpec`](#flattenspec) for more info. | no |
+| binaryAsString | Boolean | Specifies if the binary orc column which is not
logically marked as a string should be treated as a UTF-8 encoded string. | no
(default == false) |
+
+### Parquet
+
+> You need to include the
[`druid-parquet-extensions`](../development/extensions-core/parquet.md) as an
extension to use the Parquet input format.
+
+The `inputFormat` to load data of Parquet format. An example is:
+
+```json
+"ioConfig": {
+ "inputFormat": {
+ "type": "parquet",
+ "flattenSpec": {
+ "useFieldDiscovery": true,
+ "fields": [
+ {
+ "type": "path",
+ "name": "nested",
+ "expr": "$.path.to.nested"
+ }
+ ]
+ }
+ "binaryAsString": false
+ },
+ ...
+}
+```
+
+The Parquet `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+|type| String| This should be set to `parquet` to read Parquet file| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Parquet file. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
+| binaryAsString | Boolean | Specifies if the bytes parquet column which is
not logically marked as a string or enum type should be treated as a UTF-8
encoded string. | no (default == false) |
+
+### FlattenSpec
+
+The `flattenSpec` is located in `inputFormat` → `flattenSpec` and is
responsible for
+bridging the gap between potentially nested input data (such as JSON, Avro,
etc) and Druid's flat data model.
+An example `flattenSpec` is:
+
+```json
+"flattenSpec": {
+ "useFieldDiscovery": true,
+ "fields": [
+ { "name": "baz", "type": "root" },
+ { "name": "foo_bar", "type": "path", "expr": "$.foo.bar" },
+ { "name": "first_food", "type": "jq", "expr": ".thing.food[1]" }
+ ]
+}
+```
+> Conceptually, after input data records are read, the `flattenSpec` is
applied first before
+> any other specs such as [`timestampSpec`](./index.md#timestampspec),
[`transformSpec`](./index.md#transformspec),
+> [`dimensionsSpec`](./index.md#dimensionsspec), or
[`metricsSpec`](./index.md#metricsspec). Keep this in mind when writing
+> your ingestion spec.
+
+Flattening is only supported for [data formats](data-formats.md) that support
nesting, including `avro`, `json`, `orc`,
+and `parquet`.
+
+A `flattenSpec` can have the following components:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| useFieldDiscovery | If true, interpret all root-level fields as available
fields for usage by [`timestampSpec`](./index.md#timestampspec),
[`transformSpec`](./index.md#transformspec),
[`dimensionsSpec`](./index.md#dimensionsspec), and
[`metricsSpec`](./index.md#metricsspec).<br><br>If false, only explicitly
specified fields (see `fields`) will be available for use. | `true` |
+| fields | Specifies the fields of interest and how they are accessed. [See
below for details.](#field-flattening-specifications) | `[]` |
+
+#### Field flattening specifications
+
+Each entry in the `fields` list can have the following components:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| type | Options are as follows:<br><br><ul><li>`root`, referring to a field
at the root level of the record. Only really useful if `useFieldDiscovery` is
false.</li><li>`path`, referring to a field using
[JsonPath](https://github.com/jayway/JsonPath) notation. Supported by most data
formats that offer nesting, including `avro`, `json`, `orc`, and
`parquet`.</li><li>`jq`, referring to a field using
[jackson-jq](https://github.com/eiiches/jackson-jq) notation. Only supported
for the `json` format.</li></ul> | none (required) |
+| name | Name of the field after flattening. This name can be referred to by
the [`timestampSpec`](./index.md#timestampspec),
[`transformSpec`](./index.md#transformspec),
[`dimensionsSpec`](./index.md#dimensionsspec), and
[`metricsSpec`](./index.md#metricsspec).| none (required) |
+| expr | Expression for accessing the field while flattening. For type `path`,
this should be [JsonPath](https://github.com/jayway/JsonPath). For type `jq`,
this should be [jackson-jq](https://github.com/eiiches/jackson-jq) notation.
For other types, this parameter is ignored. | none (required for types `path`
and `jq`) |
+
+#### Notes on flattening
+
+* For convenience, when defining a root-level field, it is possible to define
only the field name, as a string, instead of a JSON object. For example,
`{"name": "baz", "type": "root"}` is equivalent to `"baz"`.
+* Enabling `useFieldDiscovery` will only automatically detect "simple" fields
at the root level that correspond to data types that Druid supports. This
includes strings, numbers, and lists of strings or numbers. Other types will
not be automatically detected, and must be specified explicitly in the `fields`
list.
+* Duplicate field `name`s are not allowed. An exception will be thrown.
+* If `useFieldDiscovery` is enabled, any discovered field with the same name
as one already defined in the `fields` list will be skipped, rather than added
twice.
+* [http://jsonpath.herokuapp.com/](http://jsonpath.herokuapp.com/) is useful
for testing `path`-type expressions.
+* jackson-jq supports a subset of the full
[jq](https://stedolan.github.io/jq/) syntax. Please refer to the [jackson-jq
documentation](https://github.com/eiiches/jackson-jq) for details.
+
+## Parser (Deprecated)
Review comment:
Good point. I changed as below:
```
> The Parser is deprecated for [native batch tasks](./native-batch.md),
[Kafka indexing service](../development/extensions-core/kafka-ingestion.md),
and [Kinesis indexing
service](../development/extensions-core/kinesis-ingestion.md).
Consider using the [input format](#input-format) instead for these types of
ingestion.
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]