ektravel commented on a change in pull request #11796:
URL: https://github.com/apache/druid/pull/11796#discussion_r729879169
##########
File path: docs/ingestion/data-formats.md
##########
@@ -67,12 +67,12 @@ Note that the CSV and TSV data do not contain column heads.
This becomes importa
Besides text formats, Druid also supports binary formats such as [Orc](#orc)
and [Parquet](#parquet) formats.
-## Custom Formats
+## Custom formats
Druid supports custom data formats and can use the `Regex` parser or the
`JavaScript` parsers to parse these formats. Please note that using any of
these parsers for
Review comment:
```suggestion
Druid supports custom data formats and can use the Regex parser or the
JavaScript parsers to parse these formats. Using any of these parsers for
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -67,12 +67,12 @@ Note that the CSV and TSV data do not contain column heads.
This becomes importa
Besides text formats, Druid also supports binary formats such as [Orc](#orc)
and [Parquet](#parquet) formats.
-## Custom Formats
+## Custom formats
Druid supports custom data formats and can use the `Regex` parser or the
`JavaScript` parsers to parse these formats. Please note that using any of
these parsers for
parsing data will not be as efficient as writing a native Java parser or using
an external stream processor. We welcome contributions of new Parsers.
Review comment:
```suggestion
parsing data is less efficient than writing a native Java parser or using an
external stream processor. We welcome contributions of new parsers.
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -87,7 +87,7 @@ Configure the JSON `inputFormat` to load JSON data as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `json`. | yes |
+| type | String | `json`| yes |
Review comment:
```suggestion
| type | String | JSON Object | yes |
```
Changed to JSON Object to keep consistent with other lines.
##########
File path: docs/ingestion/data-formats.md
##########
@@ -107,7 +107,7 @@ Configure the CSV `inputFormat` to load CSV data as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `csv`. | yes |
+| type | String | `csv` | yes |
Review comment:
```suggestion
| type | String | CSV | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -130,7 +130,7 @@ Configure the TSV `inputFormat` to load TSV data as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `tsv`. | yes |
+| type | String | `tsv`| yes |
Review comment:
```suggestion
| type | String | TSV | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate
delimiter for your data. Li
}
```
-### KAFKA
+### Kafka
-The `inputFormat` to load complete kafka record including header, key and
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including
header, key, and value.
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns.
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka
headers. Supports "string" types. Kafka header values are bytes, therefore the
parser decodes it as a UTF-8 encoded string. To change this
behavior,implementing your own parser based on the encoding style. You must
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom
implementation. | no |
Review comment:
```suggestion
| headerFormat | Object | `headerFormat` specifies how to parse the kafka
headers. Supports "string" types. Because kafka header values are bytes, the
parser decodes them as UTF-8 encoded strings. To change this behavior,
implement your own parser based on the encoding style. Change the 'encoding'
type in `KafkaStringHeaderFormat` to match your custom implementation. | no |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate
delimiter for your data. Li
}
```
-### KAFKA
+### Kafka
-The `inputFormat` to load complete kafka record including header, key and
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including
header, key, and value.
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns.
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka
headers. Supports "string" types. Kafka header values are bytes, therefore the
parser decodes it as a UTF-8 encoded string. To change this
behavior,implementing your own parser based on the encoding style. You must
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom
implementation. | no |
+| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used
to parse the kafka key. It only process the first entry of the input format.
See [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
for details. | no |
Review comment:
```suggestion
| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used
to parse the kafka key. It only processes the first entry of the input format.
For details, see [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format).
| no |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate
delimiter for your data. Li
}
```
-### KAFKA
+### Kafka
-The `inputFormat` to load complete kafka record including header, key and
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including
header, key, and value.
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns.
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka
headers. Supports "string" types. Kafka header values are bytes, therefore the
parser decodes it as a UTF-8 encoded string. To change this
behavior,implementing your own parser based on the encoding style. You must
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom
implementation. | no |
+| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used
to parse the kafka key. It only process the first entry of the input format.
See [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
for details. | no |
+| valueFormat | [InputFormat](#input-format) | valueFormat can be any existing
inputFormat to parse the kafka value payload. See [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
for details about specifying the input format. | yes |
Review comment:
```suggestion
| valueFormat | [InputFormat](#input-format) | `valueFormat` can be any
existing `inputFormat` to parse the kafka value payload. For details about
specifying the input format, see [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format).
| yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -179,47 +192,28 @@ The `inputFormat` to load complete kafka record including
header, key and value.
}
```
-The KAFKA `inputFormat` has the following components:
-
-> Note that KAFKA inputFormat is currently designated as experimental.
-
-| Field | Type | Description | Required |
-|-------|------|-------------|----------|
-| type | String | This should say `kafka`. | yes |
-| headerLabelPrefix | String | A custom label prefix for all the header
columns. | no (default = "kafka.header.") |
-| timestampColumnName | String | Specifies the name of the column for the
kafka record's timestamp.| no (default = "kafka.timestamp") |
-| keyColumnName | String | Specifies the name of the column for the kafka
record's key.| no (default = "kafka.key") |
-| headerFormat | Object | headerFormat specifies how to parse the kafka
headers. Current supported type is "string". Since header values are bytes, the
current parser by defaults reads it as UTF-8 encoded strings. There is
flexibility to change this behavior by implementing your very own parser based
on the encoding style. The 'encoding' type in KafkaStringHeaderFormat class
needs to change with the custom implementation. | no |
-| keyFormat | [InputFormat](#input-format) | keyFormat can be any existing
inputFormat to parse the kafka key. The current behavior is to only process the
first entry of the input format. See [the below
section](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
for details about specifying the input format. | no |
-| valueFormat | [InputFormat](#input-format) | valueFormat can be any existing
inputFormat to parse the kafka value payload. See [the below
section](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
for details about specifying the input format. | yes |
+Note the following behaviors:
+- If there are conflicts between column names, Druid uses the column names
from the payload and ignores the column name from the header or key. This
behavior makes it easier to migrate to the the Kafka `inputFormat` from another
Kafka ingestion spec without losing data.
+- The Kafka input format fundamentally blends information from the header,
key, and value objects from a Kafka record to create a row in Druid. It
extracts individual records from the value. Then it augments each value with
the corresponding key or header columns.
+- The Kafka input format by default exposes Kafka timestamp
`timestampColumnName` to make it available for use as the primary timestamp
column. Alternatively you can choose timestamp column from either the key or
value payload.
+For example, the following `timestampSpec` uses the default Kafka timestamp
from the Kafka record:
```
-> For any conflicts in dimension/metric names, this inputFormat will prefer
kafka value's column names.
-> This will enable seemless porting of existing kafka ingestion inputFormat to
this new format, with additional columns from kafka header and key.
-
-> Kafka input format fundamentally blends information from header, key and
value portions of a kafka record to create a druid row. It does this by
-> exploding individual records from the value and augmenting each of these
values with the selected key/header columns.
-
-> Kafka input format also by default exposes kafka timestamp
(timestampColumnName), which can be used as the primary timestamp column.
-> One can also choose timestamp column from either key or value payload, if
there is no timestamp available then the default kafka timestamp is our savior.
-> eg.,
-
- // Below timestampSpec chooses kafka's default timestamp that is available
in kafka record
"timestampSpec":
{
"column": "kafka.timestamp",
"format": "millis"
}
+```
- // Assuming there is a timestamp field in the header and we have
"kafka.header." as a desired prefix for header columns,
- // below example chooses header's timestamp as a primary timestamp column
+If you are using "kafka.header." as the prefix for Kafka header columns and
there is a timestamp field in the header, the following example uses the header
timestamp as the primary timestamp column:
Review comment:
```suggestion
If you are using "kafka.header." as the prefix for Kafka header columns and
there is a timestamp field in the header, the header timestamp serves as the
primary timestamp column. For example:
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -262,7 +256,7 @@ Configure the Parquet `inputFormat` to load Parquet data as
follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
+|type| String| `parquet`| yes |
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Parquet file. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
Review comment:
```suggestion
|flattenSpec| JSON Object | Defines a [`flattenSpec`](#flattenspec) to
extract nested values from a Parquet file. Note that only 'path' expressions
are supported ('jq' is unavailable).| no (default will auto-discover 'root'
level properties) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -262,7 +256,7 @@ Configure the Parquet `inputFormat` to load Parquet data as
follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
+|type| String| `parquet`| yes |
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Parquet file. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
Review comment:
```suggestion
|flattenSpec| JSON Object | Defines a [`flattenSpec`](#flattenspec) to
extract nested values from a Parquet file. Only 'path' expressions are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -262,7 +256,7 @@ Configure the Parquet `inputFormat` to load Parquet data as
follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
+|type| String| `parquet`| yes |
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Parquet file. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
Review comment:
```suggestion
|flattenSpec| JSON Object | Define a [`flattenSpec`](#flattenspec) to
extract nested values from a Parquet file. Only 'path' expressions are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -297,7 +291,7 @@ Configure the Avro `inputFormat` to load Avro data as
follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `avro_stream` to read Avro serialized
data| yes |
+|type| String| `avro_stream`| yes |
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Avro record. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
Review comment:
```suggestion
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Avro record. Only 'path' expressions are supported ('jq'
is unavailable).| no (default will auto-discover 'root' level properties) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -412,7 +406,7 @@ This Avro bytes decoder first extracts `subject` and `id`
from the input message
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `schema_repo`. | no |
+| type | String | `schema_repo` | no |
| subjectAndIdConverter | JSON Object | Specifies how to extract the subject
and id from message bytes. | yes |
Review comment:
```suggestion
| subjectAndIdConverter | JSON Object | Specifies how to extract the subject
and ID from message bytes. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -442,7 +436,7 @@ For details, see the Schema Registry
[documentation](http://docs.confluent.io/cu
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | no |
+| type | String | `schema_registry` | no |
| url | String | Specifies the url endpoint of the Schema Registry. | yes |
| capacity | Integer | Specifies the max size of the cache (default =
Integer.MAX_VALUE). | no |
| urls | Array<String> | Specifies the url endpoints of the multiple Schema
Registry instances. | yes(if `url` is not provided) |
Review comment:
```suggestion
| urls | Array<String> | Specifies the url endpoints of the multiple Schema
Registry instances. | yes (if `url` is not provided) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,7 +498,7 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data
as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|type| String| `avro_ocf`| yes |
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Avro records. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
Review comment:
```suggestion
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from Avro records. Note that only 'path' expressions are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,7 +498,7 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data
as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|type| String| `avro_ocf`| yes |
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Avro records. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
Review comment:
```suggestion
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from Avro records. Only 'path' expressions are supported ('jq' is
unavailable).| no (default will auto-discover 'root' level properties) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,7 +498,7 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data
as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|type| String| `avro_ocf`| yes |
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Avro records. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
|schema| JSON Object |Define a reader schema to be used when parsing Avro
records, this is useful when parsing multiple versions of Avro OCF file data |
no (default will decode using the writer schema contained in the OCF file) |
Review comment:
```suggestion
|schema| JSON Object |Define a reader schema to be used when parsing Avro
records. This is useful when parsing multiple versions of Avro OCF file data |
no (default will decode using the writer schema contained in the OCF file) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -645,7 +639,7 @@ Each line can be further parsed using
[`parseSpec`](#parsespec).
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `string` in general, or `hadoopyString` when
used in a Hadoop indexing job. | yes |
+| type | String | `string` for most cases. `hadoopyString` for Hadoop indexing
| yes |
Review comment:
```suggestion
| type | String | `string` for most cases. `hadoopyString` for Hadoop
indexing. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -959,7 +953,7 @@ JSON path expressions for all supported types.
|Field | Type | Description
| Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type | String | This should say `parquet`.| yes |
+| type | String | `parquet`| yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the
data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims`
and `parquet` | yes |
Review comment:
```suggestion
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the
data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims`
and `parquet`. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -718,7 +712,7 @@ The `inputFormat` of `inputSpec` in `ioConfig` must be set
to `"org.apache.orc.m
|Field | Type | Description
| Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-|type | String | This should say `orc`
| yes|
+|type | String|`orc`| yes|
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data
(`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format) | yes|
Review comment:
```suggestion
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the
data (`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format). | yes|
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -1222,7 +1216,7 @@ This parser is for [stream
ingestion](./index.md#streaming) and reads Protocol b
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `protobuf`. | yes |
+| type | String | `protobuf` | yes |
| `protoBytesDecoder` | JSON Object | Specifies how to decode bytes to
Protobuf record. | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the
data. The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more
configuration options. Note that timeAndDims parseSpec is no longer supported.
| yes |
Review comment:
```suggestion
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the
data. The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more
configuration options. `timeAndDims` `parseSpec` is no longer supported. | yes
|
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -1222,7 +1216,7 @@ This parser is for [stream
ingestion](./index.md#streaming) and reads Protocol b
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `protobuf`. | yes |
+| type | String | `protobuf` | yes |
| `protoBytesDecoder` | JSON Object | Specifies how to decode bytes to
Protobuf record. | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the
data. The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more
configuration options. Note that timeAndDims parseSpec is no longer supported.
| yes |
Review comment:
```suggestion
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the
data. The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more
configuration options. Note that `timeAndDims` `parseSpec` is no longer
supported. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -1273,7 +1267,7 @@ This Protobuf bytes decoder first read a descriptor file,
and then parse it to g
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `file`. | yes |
+| type | String | `file` | yes |
| descriptor | String | Protobuf descriptor file name in the classpath or URL.
| yes |
| protoMessageType | String | Protobuf message type in the descriptor. Both
short name and fully qualified name are accepted. The parser uses the first
message type found in the descriptor if not specified. | no |
Review comment:
```suggestion
| protoMessageType | String | Protobuf message type in the descriptor. Both
short name and fully qualified name are accepted. The parser uses the first
message type found in the descriptor if not specified. | no |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -1294,7 +1288,7 @@ For details, see the Schema Registry
[documentation](http://docs.confluent.io/cu
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | yes |
+| type | String | `schema_registry`| yes |
| url | String | Specifies the url endpoint of the Schema Registry. | yes |
Review comment:
```suggestion
| url | String | Specifies the URL endpoint of the Schema Registry. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -1294,7 +1288,7 @@ For details, see the Schema Registry
[documentation](http://docs.confluent.io/cu
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | yes |
+| type | String | `schema_registry`| yes |
| url | String | Specifies the url endpoint of the Schema Registry. | yes |
| capacity | Integer | Specifies the max size of the cache (default =
Integer.MAX_VALUE). | no |
| urls | Array<String> | Specifies the url endpoints of the multiple Schema
Registry instances. | yes(if `url` is not provided) |
Review comment:
```suggestion
| urls | Array<String> | Specifies the URL endpoints of the multiple Schema
Registry instances. | yes (if `url` is not provided) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -432,7 +426,7 @@ This section describes the format of the `schemaRepository`
object for the `sche
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `avro_1124_rest_client`. | no |
+| type | String | `avro_1124_rest_client`| no |
| url | String | Specifies the endpoint url of your Avro-1124 schema
repository. | yes |
Review comment:
```suggestion
| url | String | Specifies the endpoint URL of your Avro-1124 schema
repository. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -442,7 +436,7 @@ For details, see the Schema Registry
[documentation](http://docs.confluent.io/cu
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | no |
+| type | String | `schema_registry` | no |
| url | String | Specifies the url endpoint of the Schema Registry. | yes |
Review comment:
```suggestion
| url | String | Specifies the URL endpoint of the Schema Registry. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -442,7 +436,7 @@ For details, see the Schema Registry
[documentation](http://docs.confluent.io/cu
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | no |
+| type | String | `schema_registry` | no |
| url | String | Specifies the url endpoint of the Schema Registry. | yes |
| capacity | Integer | Specifies the max size of the cache (default =
Integer.MAX_VALUE). | no |
| urls | Array<String> | Specifies the url endpoints of the multiple Schema
Registry instances. | yes(if `url` is not provided) |
Review comment:
```suggestion
| urls | Array<String> | Specifies the URL endpoints of the multiple Schema
Registry instances. | yes (if `url` is not provided) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,7 +498,7 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data
as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|type| String| `avro_ocf`| yes |
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Avro records. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
|schema| JSON Object |Define a reader schema to be used when parsing Avro
records, this is useful when parsing multiple versions of Avro OCF file data |
no (default will decode using the writer schema contained in the OCF file) |
Review comment:
```suggestion
|schema| JSON Object |Define a reader schema to be used when parsing Avro
records. This is useful when parsing multiple versions of Avro OCF file data. |
no (default will decode using the writer schema contained in the OCF file) |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate
delimiter for your data. Li
}
```
-### KAFKA
+### Kafka
-The `inputFormat` to load complete kafka record including header, key and
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including
header, key, and value.
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns.
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka
headers. Supports "string" types. Kafka header values are bytes, therefore the
parser decodes it as a UTF-8 encoded string. To change this
behavior,implementing your own parser based on the encoding style. You must
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom
implementation. | no |
Review comment:
```suggestion
| headerFormat | Object | `headerFormat` specifies how to parse the Kafka
headers. Supports String types. Because Kafka header values are bytes, the
parser decodes them as UTF-8 encoded strings. To change this behavior,
implement your own parser based on the encoding style. Change the 'encoding'
type in `KafkaStringHeaderFormat` to match your custom implementation. | no |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate
delimiter for your data. Li
}
```
-### KAFKA
+### Kafka
-The `inputFormat` to load complete kafka record including header, key and
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including
header, key, and value.
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns.
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka
headers. Supports "string" types. Kafka header values are bytes, therefore the
parser decodes it as a UTF-8 encoded string. To change this
behavior,implementing your own parser based on the encoding style. You must
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom
implementation. | no |
+| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used
to parse the kafka key. It only process the first entry of the input format.
See [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
for details. | no |
+| valueFormat | [InputFormat](#input-format) | valueFormat can be any existing
inputFormat to parse the kafka value payload. See [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
for details about specifying the input format. | yes |
Review comment:
```suggestion
| valueFormat | [InputFormat](#input-format) | `valueFormat` can be any
existing `inputFormat` to parse the Kafka value payload. For details about
specifying the input format, see [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format).
| yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate
delimiter for your data. Li
}
```
-### KAFKA
+### Kafka
-The `inputFormat` to load complete kafka record including header, key and
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including
header, key, and value.
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns.
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka
headers. Supports "string" types. Kafka header values are bytes, therefore the
parser decodes it as a UTF-8 encoded string. To change this
behavior,implementing your own parser based on the encoding style. You must
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom
implementation. | no |
+| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used
to parse the kafka key. It only process the first entry of the input format.
See [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
for details. | no |
Review comment:
```suggestion
| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used
to parse the Kafka key. It only processes the first entry of the input format.
For details, see [Specifying data
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format).
| no |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,9 +498,9 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data
as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Avro records. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
-|schema| JSON Object |Define a reader schema to be used when parsing Avro
records, this is useful when parsing multiple versions of Avro OCF file data |
no (default will decode using the writer schema contained in the OCF file) |
+|type| String| `avro_ocf`| yes |
Review comment:
```suggestion
|type| String| Set value to `avro_ocf`| yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -553,7 +547,7 @@ Configure the Protobuf `inputFormat` to load Protobuf data
as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `protobuf` to read Protobuf serialized
data| yes |
+|type| String| `protobuf` | yes |
Review comment:
```suggestion
|type| String| Set value to `protobuf` | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -718,8 +712,8 @@ The `inputFormat` of `inputSpec` in `ioConfig` must be set
to `"org.apache.orc.m
|Field | Type | Description
| Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-|type | String | This should say `orc`
| yes|
-|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data
(`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format) | yes|
+|type | String|`orc`| yes|
Review comment:
```suggestion
| type | String | Set value to `orc` | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -959,8 +953,8 @@ JSON path expressions for all supported types.
|Field | Type | Description
| Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type | String | This should say `parquet`.| yes |
-| parseSpec | JSON Object | Specifies the timestamp and dimensions of the
data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims`
and `parquet` | yes |
+| type | String | `parquet`| yes |
Review comment:
```suggestion
| type | String | Set value to `parquet`| yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -1109,7 +1103,7 @@ Note that the `int96` Parquet value type is not supported
with this parser.
|Field | Type | Description
| Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type | String | This should say `parquet-avro`. | yes |
+| type | String | `parquet-avro` | yes |
Review comment:
```suggestion
| type | String | Set value to `parquet-avro` | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -297,8 +291,8 @@ Configure the Avro `inputFormat` to load Avro data as
follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `avro_stream` to read Avro serialized
data| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Avro record. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
+|type| String| `avro_stream`| yes |
Review comment:
```suggestion
|type| String| Set value to `avro_stream`. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,9 +498,9 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data
as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Avro records. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
-|schema| JSON Object |Define a reader schema to be used when parsing Avro
records, this is useful when parsing multiple versions of Avro OCF file data |
no (default will decode using the writer schema contained in the OCF file) |
+|type| String| `avro_ocf`| yes |
Review comment:
```suggestion
|type| String| Set value to `avro_ocf`. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -553,7 +547,7 @@ Configure the Protobuf `inputFormat` to load Protobuf data
as follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `protobuf` to read Protobuf serialized
data| yes |
+|type| String| `protobuf` | yes |
Review comment:
```suggestion
|type| String| Set value to `protobuf`. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -718,8 +712,8 @@ The `inputFormat` of `inputSpec` in `ioConfig` must be set
to `"org.apache.orc.m
|Field | Type | Description
| Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-|type | String | This should say `orc`
| yes|
-|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data
(`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format) | yes|
+|type | String|`orc`| yes|
Review comment:
```suggestion
| type | String | Set value to `orc`. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -959,8 +953,8 @@ JSON path expressions for all supported types.
|Field | Type | Description
| Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type | String | This should say `parquet`.| yes |
-| parseSpec | JSON Object | Specifies the timestamp and dimensions of the
data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims`
and `parquet` | yes |
+| type | String | `parquet`| yes |
Review comment:
```suggestion
| type | String | Set value to `parquet`. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -1109,7 +1103,7 @@ Note that the `int96` Parquet value type is not supported
with this parser.
|Field | Type | Description
| Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type | String | This should say `parquet-avro`. | yes |
+| type | String | `parquet-avro` | yes |
Review comment:
```suggestion
| type | String | Set value to `parquet-avro`. | yes |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -1182,7 +1176,7 @@ This parser is for [stream
ingestion](./index.md#streaming) and reads Avro data
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-| type | String | This should say `avro_stream`. | no |
+| type | String | Set value to`avro_stream`. | no |
Review comment:
```suggestion
| type | String | Set value to `avro_stream`. | no |
```
##########
File path: docs/ingestion/data-formats.md
##########
@@ -262,8 +256,8 @@ Configure the Parquet `inputFormat` to load Parquet data as
follows:
| Field | Type | Description | Required |
|-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract
nested values from a Parquet file. Note that only 'path' expression are
supported ('jq' is unavailable).| no (default will auto-discover 'root' level
properties) |
+|type| String| `parquet`| yes |
Review comment:
```suggestion
|type| String| Set value to `parquet`.| yes |
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]