[GitHub] [druid] ektravel commented on a change in pull request #11796: edits to kafka inputFormat

GitBox Sat, 16 Oct 2021 17:38:20 -0700


ektravel commented on a change in pull request #11796:
URL: https://github.com/apache/druid/pull/11796#discussion_r729879169




##########
File path: docs/ingestion/data-formats.md
##########
@@ -67,12 +67,12 @@ Note that the CSV and TSV data do not contain column heads. 
This becomes importa
 
 Besides text formats, Druid also supports binary formats such as [Orc](#orc) 
and [Parquet](#parquet) formats.
 
-## Custom Formats
+## Custom formats
 
 Druid supports custom data formats and can use the `Regex` parser or the 
`JavaScript` parsers to parse these formats. Please note that using any of 
these parsers for

Review comment:
       ```suggestion
   Druid supports custom data formats and can use the Regex parser or the 
JavaScript parsers to parse these formats. Using any of these parsers for
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -67,12 +67,12 @@ Note that the CSV and TSV data do not contain column heads. 
This becomes importa
 
 Besides text formats, Druid also supports binary formats such as [Orc](#orc) 
and [Parquet](#parquet) formats.
 
-## Custom Formats
+## Custom formats
 
 Druid supports custom data formats and can use the `Regex` parser or the 
`JavaScript` parsers to parse these formats. Please note that using any of 
these parsers for
 parsing data will not be as efficient as writing a native Java parser or using 
an external stream processor. We welcome contributions of new Parsers.

Review comment:
       ```suggestion
   parsing data is less efficient than writing a native Java parser or using an 
external stream processor. We welcome contributions of new parsers.
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -87,7 +87,7 @@ Configure the JSON `inputFormat` to load JSON data as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `json`. | yes |
+| type | String | `json`| yes |

Review comment:
       ```suggestion
   | type | String | JSON Object | yes |
   ```
   Changed to JSON Object to keep consistent with other lines.

##########
File path: docs/ingestion/data-formats.md
##########
@@ -107,7 +107,7 @@ Configure the CSV `inputFormat` to load CSV data as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `csv`. | yes |
+| type | String | `csv` | yes |

Review comment:
       ```suggestion
   | type | String | CSV | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -130,7 +130,7 @@ Configure the TSV `inputFormat` to load TSV data as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `tsv`. | yes |
+| type | String | `tsv`| yes |

Review comment:
       ```suggestion
   | type | String | TSV | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate 
delimiter for your data. Li
 }
 ```
 
-### KAFKA
+### Kafka
 
-The `inputFormat` to load complete kafka record including header, key and 
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including 
header, key, and value. 
 
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns. 
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's 
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no 
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka 
headers. Supports "string" types. Kafka header values are bytes, therefore the 
parser decodes it as a UTF-8 encoded string. To change this 
behavior,implementing your own parser based on the encoding style. You must 
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom 
implementation. | no |

Review comment:
       ```suggestion
   | headerFormat | Object | `headerFormat` specifies how to parse the kafka 
headers. Supports "string" types. Because kafka header values are bytes, the 
parser decodes them as UTF-8 encoded strings. To change this behavior, 
implement your own parser based on the encoding style. Change the 'encoding' 
type in `KafkaStringHeaderFormat` to match your custom implementation. | no |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate 
delimiter for your data. Li
 }
 ```
 
-### KAFKA
+### Kafka
 
-The `inputFormat` to load complete kafka record including header, key and 
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including 
header, key, and value. 
 
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns. 
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's 
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no 
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka 
headers. Supports "string" types. Kafka header values are bytes, therefore the 
parser decodes it as a UTF-8 encoded string. To change this 
behavior,implementing your own parser based on the encoding style. You must 
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom 
implementation. | no |
+| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used 
to parse the kafka key. It only process the first entry of the input format. 
See [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
 for details. | no |

Review comment:
       ```suggestion
   | keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used 
to parse the kafka key. It only processes the first entry of the input format. 
For details, see [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format).
 | no |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate 
delimiter for your data. Li
 }
 ```
 
-### KAFKA
+### Kafka
 
-The `inputFormat` to load complete kafka record including header, key and 
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including 
header, key, and value. 
 
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns. 
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's 
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no 
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka 
headers. Supports "string" types. Kafka header values are bytes, therefore the 
parser decodes it as a UTF-8 encoded string. To change this 
behavior,implementing your own parser based on the encoding style. You must 
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom 
implementation. | no |
+| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used 
to parse the kafka key. It only process the first entry of the input format. 
See [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
 for details. | no |
+| valueFormat | [InputFormat](#input-format) | valueFormat can be any existing 
inputFormat to parse the kafka value payload. See [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
 for details about specifying the input format. | yes |

Review comment:
       ```suggestion
   | valueFormat | [InputFormat](#input-format) | `valueFormat` can be any 
existing `inputFormat` to parse the kafka value payload. For details about 
specifying the input format, see [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format).
 | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -179,47 +192,28 @@ The `inputFormat` to load complete kafka record including 
header, key and value.
 }
 ```
 
-The KAFKA `inputFormat` has the following components:
-
-> Note that KAFKA inputFormat is currently designated as experimental.
-
-| Field | Type | Description | Required |
-|-------|------|-------------|----------|
-| type | String | This should say `kafka`. | yes |
-| headerLabelPrefix | String | A custom label prefix for all the header 
columns. | no (default = "kafka.header.") |
-| timestampColumnName | String | Specifies the name of the column for the 
kafka record's timestamp.| no (default = "kafka.timestamp") |
-| keyColumnName | String | Specifies the name of the column for the kafka 
record's key.| no (default = "kafka.key") |
-| headerFormat | Object | headerFormat specifies how to parse the kafka 
headers. Current supported type is "string". Since header values are bytes, the 
current parser by defaults reads it as UTF-8 encoded strings. There is 
flexibility to change this behavior by implementing your very own parser based 
on the encoding style. The 'encoding' type in KafkaStringHeaderFormat class 
needs to change with the custom implementation. | no |
-| keyFormat | [InputFormat](#input-format) | keyFormat can be any existing 
inputFormat to parse the kafka key. The current behavior is to only process the 
first entry of the input format. See [the below 
section](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
 for details about specifying the input format. | no |
-| valueFormat | [InputFormat](#input-format) | valueFormat can be any existing 
inputFormat to parse the kafka value payload. See [the below 
section](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
 for details about specifying the input format. | yes |
+Note the following behaviors:
+- If there are conflicts between column names, Druid uses the column names 
from the payload and ignores the column name from the header or key. This 
behavior makes it easier to migrate to the the Kafka `inputFormat` from another 
Kafka ingestion spec without losing data.
+- The Kafka input format fundamentally blends information from the header, 
key, and value objects from a Kafka record to create a row in Druid. It 
extracts individual records from the value. Then it augments each value with 
the corresponding key or header columns.
+- The Kafka input format by default exposes Kafka timestamp 
`timestampColumnName` to make it available for use as the primary timestamp 
column. Alternatively you can choose timestamp column from either the key or 
value payload.
 
+For example, the following `timestampSpec` uses the default Kafka timestamp 
from the Kafka record:
 ```
-> For any conflicts in dimension/metric names, this inputFormat will prefer 
kafka value's column names.
-> This will enable seemless porting of existing kafka ingestion inputFormat to 
this new format, with additional columns from kafka header and key.
-
-> Kafka input format fundamentally blends information from header, key and 
value portions of a kafka record to create a druid row. It does this by 
-> exploding individual records from the value and augmenting each of these 
values with the selected key/header columns.
-
-> Kafka input format also by default exposes kafka timestamp 
(timestampColumnName), which can be used as the primary timestamp column. 
-> One can also choose timestamp column from either key or value payload, if 
there is no timestamp available then the default kafka timestamp is our savior.
-> eg.,
-
-    // Below timestampSpec chooses kafka's default timestamp that is available 
in kafka record
     "timestampSpec":
     {
         "column": "kafka.timestamp",
         "format": "millis"
     }
+```
     
-    // Assuming there is a timestamp field in the header and we have 
"kafka.header." as a desired prefix for header columns,
-    // below example chooses header's timestamp as a primary timestamp column
+If you are using "kafka.header." as the prefix for Kafka header columns and 
there is a timestamp field in the header, the following example uses the header 
timestamp as the primary timestamp column:

Review comment:
       ```suggestion
   If you are using "kafka.header." as the prefix for Kafka header columns and 
there is a timestamp field in the header, the header timestamp serves as the 
primary timestamp column. For example:
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -262,7 +256,7 @@ Configure the Parquet `inputFormat` to load Parquet data as 
follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
+|type| String| `parquet`| yes |
 |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Parquet file. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |

Review comment:
       ```suggestion
   |flattenSpec| JSON Object | Defines a [`flattenSpec`](#flattenspec) to 
extract nested values from a Parquet file. Note that only 'path' expressions 
are supported ('jq' is unavailable).| no (default will auto-discover 'root' 
level properties) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -262,7 +256,7 @@ Configure the Parquet `inputFormat` to load Parquet data as 
follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
+|type| String| `parquet`| yes |
 |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Parquet file. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |

Review comment:
       ```suggestion
   |flattenSpec| JSON Object | Defines a [`flattenSpec`](#flattenspec) to 
extract nested values from a Parquet file. Only 'path' expressions are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -262,7 +256,7 @@ Configure the Parquet `inputFormat` to load Parquet data as 
follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
+|type| String| `parquet`| yes |
 |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Parquet file. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |

Review comment:
       ```suggestion
   |flattenSpec| JSON Object | Define a [`flattenSpec`](#flattenspec) to 
extract nested values from a Parquet file. Only 'path' expressions are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -297,7 +291,7 @@ Configure the Avro `inputFormat` to load Avro data as 
follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `avro_stream` to read Avro serialized 
data| yes |
+|type| String| `avro_stream`| yes |
 |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Avro record. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |

Review comment:
       ```suggestion
   |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Avro record. Only 'path' expressions are supported ('jq' 
is unavailable).| no (default will auto-discover 'root' level properties) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -412,7 +406,7 @@ This Avro bytes decoder first extracts `subject` and `id` 
from the input message
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `schema_repo`. | no |
+| type | String | `schema_repo` | no |
 | subjectAndIdConverter | JSON Object | Specifies how to extract the subject 
and id from message bytes. | yes |

Review comment:
       ```suggestion
   | subjectAndIdConverter | JSON Object | Specifies how to extract the subject 
and ID from message bytes. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -442,7 +436,7 @@ For details, see the Schema Registry 
[documentation](http://docs.confluent.io/cu
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | no |
+| type | String | `schema_registry` | no |
 | url | String | Specifies the url endpoint of the Schema Registry. | yes |
 | capacity | Integer | Specifies the max size of the cache (default = 
Integer.MAX_VALUE). | no |
 | urls | Array<String> | Specifies the url endpoints of the multiple Schema 
Registry instances. | yes(if `url` is not provided) |

Review comment:
       ```suggestion
   | urls | Array<String> | Specifies the url endpoints of the multiple Schema 
Registry instances. | yes (if `url` is not provided) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,7 +498,7 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data 
as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|type| String|  `avro_ocf`| yes |
 |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Avro records. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |

Review comment:
       ```suggestion
   |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from Avro records. Note that only 'path' expressions are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,7 +498,7 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data 
as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|type| String|  `avro_ocf`| yes |
 |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Avro records. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |

Review comment:
       ```suggestion
   |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from Avro records. Only 'path' expressions are supported ('jq' is 
unavailable).| no (default will auto-discover 'root' level properties) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,7 +498,7 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data 
as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|type| String|  `avro_ocf`| yes |
 |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Avro records. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
 |schema| JSON Object |Define a reader schema to be used when parsing Avro 
records, this is useful when parsing multiple versions of Avro OCF file data | 
no (default will decode using the writer schema contained in the OCF file) |

Review comment:
       ```suggestion
   |schema| JSON Object |Define a reader schema to be used when parsing Avro 
records. This is useful when parsing multiple versions of Avro OCF file data | 
no (default will decode using the writer schema contained in the OCF file) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -645,7 +639,7 @@ Each line can be further parsed using 
[`parseSpec`](#parsespec).
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `string` in general, or `hadoopyString` when 
used in a Hadoop indexing job. | yes |
+| type | String | `string` for most cases. `hadoopyString` for Hadoop indexing 
| yes |

Review comment:
       ```suggestion
   | type | String | `string` for most cases. `hadoopyString` for Hadoop 
indexing. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -959,7 +953,7 @@ JSON path expressions for all supported types.
 
 |Field     | Type        | Description                                         
                                   | Required|
 
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type      | String      | This should say `parquet`.| yes |
+| type      | String      |  `parquet`| yes |
 | parseSpec | JSON Object | Specifies the timestamp and dimensions of the 
data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims` 
and `parquet` | yes |

Review comment:
       ```suggestion
   | parseSpec | JSON Object | Specifies the timestamp and dimensions of the 
data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims` 
and `parquet`. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -718,7 +712,7 @@ The `inputFormat` of `inputSpec` in `ioConfig` must be set 
to `"org.apache.orc.m
 
 |Field     | Type        | Description                                         
                                   | Required|
 
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-|type      | String      | This should say `orc`                               
                                   | yes|
+|type      | String|`orc`| yes|
 |parseSpec | JSON Object | Specifies the timestamp and dimensions of the data 
(`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format) | yes|

Review comment:
       ```suggestion
   |parseSpec | JSON Object | Specifies the timestamp and dimensions of the 
data (`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format). | yes|
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -1222,7 +1216,7 @@ This parser is for [stream 
ingestion](./index.md#streaming) and reads Protocol b
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `protobuf`. | yes |
+| type | String | `protobuf` | yes |
 | `protoBytesDecoder` | JSON Object | Specifies how to decode bytes to 
Protobuf record. | yes |
 | parseSpec | JSON Object | Specifies the timestamp and dimensions of the 
data.  The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more 
configuration options.  Note that timeAndDims parseSpec is no longer supported. 
| yes |

Review comment:
       ```suggestion
   | parseSpec | JSON Object | Specifies the timestamp and dimensions of the 
data.  The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more 
configuration options.  `timeAndDims` `parseSpec` is no longer supported. | yes 
|
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -1222,7 +1216,7 @@ This parser is for [stream 
ingestion](./index.md#streaming) and reads Protocol b
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `protobuf`. | yes |
+| type | String | `protobuf` | yes |
 | `protoBytesDecoder` | JSON Object | Specifies how to decode bytes to 
Protobuf record. | yes |
 | parseSpec | JSON Object | Specifies the timestamp and dimensions of the 
data.  The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more 
configuration options.  Note that timeAndDims parseSpec is no longer supported. 
| yes |

Review comment:
       ```suggestion
   | parseSpec | JSON Object | Specifies the timestamp and dimensions of the 
data.  The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more 
configuration options. Note that `timeAndDims` `parseSpec` is no longer 
supported. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -1273,7 +1267,7 @@ This Protobuf bytes decoder first read a descriptor file, 
and then parse it to g
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `file`. | yes |
+| type | String | `file` | yes |
 | descriptor | String | Protobuf descriptor file name in the classpath or URL. 
| yes |
 | protoMessageType | String | Protobuf message type in the descriptor.  Both 
short name and fully qualified name are accepted.  The parser uses the first 
message type found in the descriptor if not specified. | no |

Review comment:
       ```suggestion
   | protoMessageType | String | Protobuf message type in the descriptor.  Both 
short name and fully qualified name are accepted. The parser uses the first 
message type found in the descriptor if not specified. | no |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -1294,7 +1288,7 @@ For details, see the Schema Registry 
[documentation](http://docs.confluent.io/cu
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | yes |
+| type | String | `schema_registry`| yes |
 | url | String | Specifies the url endpoint of the Schema Registry. | yes |

Review comment:
       ```suggestion
   | url | String | Specifies the URL endpoint of the Schema Registry. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -1294,7 +1288,7 @@ For details, see the Schema Registry 
[documentation](http://docs.confluent.io/cu
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | yes |
+| type | String | `schema_registry`| yes |
 | url | String | Specifies the url endpoint of the Schema Registry. | yes |
 | capacity | Integer | Specifies the max size of the cache (default = 
Integer.MAX_VALUE). | no |
 | urls | Array<String> | Specifies the url endpoints of the multiple Schema 
Registry instances. | yes(if `url` is not provided) |

Review comment:
       ```suggestion
   | urls | Array<String> | Specifies the URL endpoints of the multiple Schema 
Registry instances. | yes (if `url` is not provided) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -432,7 +426,7 @@ This section describes the format of the `schemaRepository` 
object for the `sche
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `avro_1124_rest_client`. | no |
+| type | String | `avro_1124_rest_client`| no |
 | url | String | Specifies the endpoint url of your Avro-1124 schema 
repository. | yes |

Review comment:
       ```suggestion
   | url | String | Specifies the endpoint URL of your Avro-1124 schema 
repository. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -442,7 +436,7 @@ For details, see the Schema Registry 
[documentation](http://docs.confluent.io/cu
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | no |
+| type | String | `schema_registry` | no |
 | url | String | Specifies the url endpoint of the Schema Registry. | yes |

Review comment:
       ```suggestion
   | url | String | Specifies the URL endpoint of the Schema Registry. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -442,7 +436,7 @@ For details, see the Schema Registry 
[documentation](http://docs.confluent.io/cu
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `schema_registry`. | no |
+| type | String | `schema_registry` | no |
 | url | String | Specifies the url endpoint of the Schema Registry. | yes |
 | capacity | Integer | Specifies the max size of the cache (default = 
Integer.MAX_VALUE). | no |
 | urls | Array<String> | Specifies the url endpoints of the multiple Schema 
Registry instances. | yes(if `url` is not provided) |

Review comment:
       ```suggestion
   | urls | Array<String> | Specifies the URL endpoints of the multiple Schema 
Registry instances. | yes (if `url` is not provided) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,7 +498,7 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data 
as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|type| String|  `avro_ocf`| yes |
 |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Avro records. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
 |schema| JSON Object |Define a reader schema to be used when parsing Avro 
records, this is useful when parsing multiple versions of Avro OCF file data | 
no (default will decode using the writer schema contained in the OCF file) |

Review comment:
       ```suggestion
   |schema| JSON Object |Define a reader schema to be used when parsing Avro 
records. This is useful when parsing multiple versions of Avro OCF file data. | 
no (default will decode using the writer schema contained in the OCF file) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate 
delimiter for your data. Li
 }
 ```
 
-### KAFKA
+### Kafka
 
-The `inputFormat` to load complete kafka record including header, key and 
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including 
header, key, and value. 
 
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns. 
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's 
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no 
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka 
headers. Supports "string" types. Kafka header values are bytes, therefore the 
parser decodes it as a UTF-8 encoded string. To change this 
behavior,implementing your own parser based on the encoding style. You must 
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom 
implementation. | no |

Review comment:
       ```suggestion
   | headerFormat | Object | `headerFormat` specifies how to parse the Kafka 
headers. Supports String types. Because Kafka header values are bytes, the 
parser decodes them as UTF-8 encoded strings. To change this behavior, 
implement your own parser based on the encoding style. Change the 'encoding' 
type in `KafkaStringHeaderFormat` to match your custom implementation. | no |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate 
delimiter for your data. Li
 }
 ```
 
-### KAFKA
+### Kafka
 
-The `inputFormat` to load complete kafka record including header, key and 
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including 
header, key, and value. 
 
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns. 
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's 
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no 
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka 
headers. Supports "string" types. Kafka header values are bytes, therefore the 
parser decodes it as a UTF-8 encoded string. To change this 
behavior,implementing your own parser based on the encoding style. You must 
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom 
implementation. | no |
+| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used 
to parse the kafka key. It only process the first entry of the input format. 
See [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
 for details. | no |
+| valueFormat | [InputFormat](#input-format) | valueFormat can be any existing 
inputFormat to parse the kafka value payload. See [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
 for details about specifying the input format. | yes |

Review comment:
       ```suggestion
   | valueFormat | [InputFormat](#input-format) | `valueFormat` can be any 
existing `inputFormat` to parse the Kafka value payload. For details about 
specifying the input format, see [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format).
 | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -151,11 +151,24 @@ Be sure to change the `delimiter` to the appropriate 
delimiter for your data. Li
 }
 ```
 
-### KAFKA
+### Kafka
 
-The `inputFormat` to load complete kafka record including header, key and 
value. An example is:
+Configure the Kafka `inputFormat` to load complete kafka records including 
header, key, and value. 
 
-```json
+> That Kafka inputFormat is currently designated as experimental.
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+| type | String | `kafka`| yes |
+| headerLabelPrefix | String | Custom label prefix for all the header columns. 
| no (default = "kafka.header.") |
+| timestampColumnName | String | Name of the column for the kafka record's 
timestamp.| no (default = "kafka.timestamp") |
+| keyColumnName | String | Name of the column for the kafka record's key.| no 
(default = "kafka.key") |
+| headerFormat | Object | `headerFormat` specifies how to parse the kafka 
headers. Supports "string" types. Kafka header values are bytes, therefore the 
parser decodes it as a UTF-8 encoded string. To change this 
behavior,implementing your own parser based on the encoding style. You must 
change the 'encoding' type in `KafkaStringHeaderFormat` to match your custom 
implementation. | no |
+| keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used 
to parse the kafka key. It only process the first entry of the input format. 
See [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format)
 for details. | no |

Review comment:
       ```suggestion
   | keyFormat | [InputFormat](#input-format) | Any existing `inputFormat` used 
to parse the Kafka key. It only processes the first entry of the input format. 
For details, see [Specifying data 
format](../development/extensions-core/kafka-ingestion.md#specifying-data-format).
 | no |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,9 +498,9 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data 
as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Avro records. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
-|schema| JSON Object |Define a reader schema to be used when parsing Avro 
records, this is useful when parsing multiple versions of Avro OCF file data | 
no (default will decode using the writer schema contained in the OCF file) |
+|type| String|  `avro_ocf`| yes |

Review comment:
       ```suggestion
   |type| String|  Set value to `avro_ocf`| yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -553,7 +547,7 @@ Configure the Protobuf `inputFormat` to load Protobuf data 
as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `protobuf` to read Protobuf serialized 
data| yes |
+|type| String| `protobuf` | yes |

Review comment:
       ```suggestion
   |type| String| Set value to `protobuf` | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -718,8 +712,8 @@ The `inputFormat` of `inputSpec` in `ioConfig` must be set 
to `"org.apache.orc.m
 
 |Field     | Type        | Description                                         
                                   | Required|
 
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-|type      | String      | This should say `orc`                               
                                   | yes|
-|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data 
(`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format) | yes|
+|type      | String|`orc`| yes|

Review comment:
       ```suggestion
   | type | String | Set value to `orc` | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -959,8 +953,8 @@ JSON path expressions for all supported types.
 
 |Field     | Type        | Description                                         
                                   | Required|
 
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type      | String      | This should say `parquet`.| yes |
-| parseSpec | JSON Object | Specifies the timestamp and dimensions of the 
data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims` 
and `parquet` | yes |
+| type      | String      |  `parquet`| yes |

Review comment:
       ```suggestion
   | type      | String      | Set value to `parquet`| yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -1109,7 +1103,7 @@ Note that the `int96` Parquet value type is not supported 
with this parser.
 
 |Field     | Type        | Description                                         
                                   | Required|
 
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type      | String      | This should say `parquet-avro`. | yes |
+| type      | String      | `parquet-avro` | yes |

Review comment:
       ```suggestion
   | type      | String      | Set value to `parquet-avro` | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -297,8 +291,8 @@ Configure the Avro `inputFormat` to load Avro data as 
follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `avro_stream` to read Avro serialized 
data| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Avro record. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
+|type| String| `avro_stream`| yes |

Review comment:
       ```suggestion
   |type| String| Set value to `avro_stream`. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -504,9 +498,9 @@ Configure the Avro OCF `inputFormat` to load Avro OCF data 
as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Avro records. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
-|schema| JSON Object |Define a reader schema to be used when parsing Avro 
records, this is useful when parsing multiple versions of Avro OCF file data | 
no (default will decode using the writer schema contained in the OCF file) |
+|type| String|  `avro_ocf`| yes |

Review comment:
       ```suggestion
   |type| String|  Set value to `avro_ocf`. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -553,7 +547,7 @@ Configure the Protobuf `inputFormat` to load Protobuf data 
as follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `protobuf` to read Protobuf serialized 
data| yes |
+|type| String| `protobuf` | yes |

Review comment:
       ```suggestion
   |type| String| Set value to `protobuf`. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -718,8 +712,8 @@ The `inputFormat` of `inputSpec` in `ioConfig` must be set 
to `"org.apache.orc.m
 
 |Field     | Type        | Description                                         
                                   | Required|
 
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-|type      | String      | This should say `orc`                               
                                   | yes|
-|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data 
(`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format) | yes|
+|type      | String|`orc`| yes|

Review comment:
       ```suggestion
   | type | String | Set value to `orc`. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -959,8 +953,8 @@ JSON path expressions for all supported types.
 
 |Field     | Type        | Description                                         
                                   | Required|
 
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type      | String      | This should say `parquet`.| yes |
-| parseSpec | JSON Object | Specifies the timestamp and dimensions of the 
data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims` 
and `parquet` | yes |
+| type      | String      |  `parquet`| yes |

Review comment:
       ```suggestion
   | type      | String      | Set value to `parquet`. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -1109,7 +1103,7 @@ Note that the `int96` Parquet value type is not supported 
with this parser.
 
 |Field     | Type        | Description                                         
                                   | Required|
 
|----------|-------------|----------------------------------------------------------------------------------------|---------|
-| type      | String      | This should say `parquet-avro`. | yes |
+| type      | String      | `parquet-avro` | yes |

Review comment:
       ```suggestion
   | type      | String      | Set value to `parquet-avro`. | yes |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -1182,7 +1176,7 @@ This parser is for [stream 
ingestion](./index.md#streaming) and reads Avro data
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-| type | String | This should say `avro_stream`. | no |
+| type | String | Set value to`avro_stream`. | no |

Review comment:
       ```suggestion
   | type | String | Set value to `avro_stream`. | no |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -262,8 +256,8 @@ Configure the Parquet `inputFormat` to load Parquet data as 
follows:
 
 | Field | Type | Description | Required |
 |-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract 
nested values from a Parquet file. Note that only 'path' expression are 
supported ('jq' is unavailable).| no (default will auto-discover 'root' level 
properties) |
+|type| String| `parquet`| yes |

Review comment:
       ```suggestion
   |type| String| Set value to `parquet`.| yes |
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] ektravel commented on a change in pull request #11796: edits to kafka inputFormat

Reply via email to