techdocsmith commented on code in PR #14529: URL: https://github.com/apache/druid/pull/14529#discussion_r1273897060
########## docs/development/extensions-core/kinesis-ingestion.md: ########## @@ -23,154 +23,259 @@ sidebar_label: "Amazon Kinesis" ~ under the License. --> -When you enable the Kinesis indexing service, you can configure *supervisors* on the Overlord to manage the creation and lifetime of Kinesis indexing tasks. These indexing tasks read events using Kinesis' own shard and sequence number mechanism to guarantee exactly-once ingestion. The supervisor oversees the state of the indexing tasks to: +When you enable the Kinesis indexing service, you can configure supervisors on the Overlord to manage the creation and lifetime of Kinesis indexing tasks. These indexing tasks read events using Kinesis' own shard and sequence number mechanism to guarantee exactly-once ingestion. The supervisor oversees the state of the indexing tasks to coordinate handoffs, manage failures, and ensure that scalability and replication requirements are maintained. -- coordinate handoffs -- manage failures -- ensure that scalability and replication requirements are maintained. +This topic contains configuration reference information for the Kinesis indexing service supervisor for Apache Druid. -To use the Kinesis indexing service, load the `druid-kinesis-indexing-service` core Apache Druid extension (see -[Including Extensions](../../configuration/extensions.md#loading-extensions)). +## Setup -> Before you deploy the Kinesis extension to production, read the [Kinesis known issues](#kinesis-known-issues). +To use the Kinesis indexing service, you must first load the `druid-kinesis-indexing-service` core extension on both the Overlord and the Middle Manager. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information. +We recommend that you review the [Kinesis known issues](#kinesis-known-issues) before deploying the `druid-kinesis-indexing-service` extension to production. -## Submitting a Supervisor Spec +## Supervisor spec -To use the Kinesis indexing service, load the `druid-kinesis-indexing-service` extension on both the Overlord and the MiddleManagers. Druid starts a supervisor for a dataSource when you submit a supervisor spec. Submit your supervisor spec to the following endpoint: +The following table outlines the high-level configuration options for the Kinesis supervisor object. +See [Supervisor API](../../api-reference/supervisor-api.md) for more information. -`http://<OVERLORD_IP>:<OVERLORD_PORT>/druid/indexer/v1/supervisor` +|Property|Type|Description|Required| +|--------|----|-----------|--------| +|`type`|String|The supervisor type; this should always be `kinesis`.|Yes| +|`spec`|Object|The container object for the supervisor configuration.|Yes| +|`ioConfig`|Object|The [I/O configuration](#supervisor-io-configuration) object for configuring Kafka connection and I/O-related settings for the supervisor and indexing task.|Yes| +|`dataSchema`|Object|The schema used by the Kinesis indexing task during ingestion. See [`dataSchema`](../../ingestion/ingestion-spec.md#dataschema) for more information.|Yes| +|`tuningConfig`|Object|The [tuning configuration](#supervisor-tuning-configuration) object for configuring performance-related settings for the supervisor and indexing tasks.|No| -For example: +Druid starts a new supervisor when you define a supervisor spec. +To create a supervisor, send a `POST` request to the `/druid/indexer/v1/supervisor` endpoint. +Once created, the supervisor persists in the configured metadata database. There can only be a single supervisor per datasource, and submitting a second spec for the same datasource overwrites the previous one. -```sh -curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http://localhost:8090/druid/indexer/v1/supervisor -``` +When an Overlord gains leadership, either by being started or as a result of another Overlord failing, it spawns +a supervisor for each supervisor spec in the metadata database. The supervisor then discovers running Kinesis indexing +tasks and attempts to adopt them if they are compatible with the supervisor's configuration. If they are not +compatible because they have a different ingestion spec or shard allocation, the tasks are killed and the +supervisor creates a new set of tasks. In this way, the supervisors persist across Overlord restarts and failovers. -Where the file `supervisor-spec.json` contains a Kinesis supervisor spec: +The following example shows how to submit a supervisor spec for a stream with the name `KinesisStream`. +In this example, `http://SERVICE_IP:SERVICE_PORT` is a placeholder for the server address of deployment and the service port. -```json -{ +<!--DOCUSAURUS_CODE_TABS--> + +<!--cURL--> +```shell +curl -X POST "http://SERVICE_IP:SERVICE_PORT/druid/indexer/v1/supervisor" \ Review Comment: normally we're running all HTTP API calls through the router. Retest it with this : `curl -X POST "http://SERVICE_IP:SERVICE_PORT/druid/router/v1/supervisor"` ` ########## docs/development/extensions-core/kinesis-ingestion.md: ########## @@ -23,154 +23,259 @@ sidebar_label: "Amazon Kinesis" ~ under the License. --> -When you enable the Kinesis indexing service, you can configure *supervisors* on the Overlord to manage the creation and lifetime of Kinesis indexing tasks. These indexing tasks read events using Kinesis' own shard and sequence number mechanism to guarantee exactly-once ingestion. The supervisor oversees the state of the indexing tasks to: +When you enable the Kinesis indexing service, you can configure supervisors on the Overlord to manage the creation and lifetime of Kinesis indexing tasks. These indexing tasks read events using Kinesis' own shard and sequence number mechanism to guarantee exactly-once ingestion. The supervisor oversees the state of the indexing tasks to coordinate handoffs, manage failures, and ensure that scalability and replication requirements are maintained. -- coordinate handoffs -- manage failures -- ensure that scalability and replication requirements are maintained. +This topic contains configuration reference information for the Kinesis indexing service supervisor for Apache Druid. -To use the Kinesis indexing service, load the `druid-kinesis-indexing-service` core Apache Druid extension (see -[Including Extensions](../../configuration/extensions.md#loading-extensions)). +## Setup -> Before you deploy the Kinesis extension to production, read the [Kinesis known issues](#kinesis-known-issues). +To use the Kinesis indexing service, you must first load the `druid-kinesis-indexing-service` core extension on both the Overlord and the Middle Manager. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information. +We recommend that you review the [Kinesis known issues](#kinesis-known-issues) before deploying the `druid-kinesis-indexing-service` extension to production. Review Comment: ```suggestion Review the [Kinesis known issues](#kinesis-known-issues) before deploying the `druid-kinesis-indexing-service` extension to production. ``` ########## docs/development/extensions-core/kinesis-ingestion.md: ########## @@ -23,154 +23,259 @@ sidebar_label: "Amazon Kinesis" ~ under the License. --> -When you enable the Kinesis indexing service, you can configure *supervisors* on the Overlord to manage the creation and lifetime of Kinesis indexing tasks. These indexing tasks read events using Kinesis' own shard and sequence number mechanism to guarantee exactly-once ingestion. The supervisor oversees the state of the indexing tasks to: +When you enable the Kinesis indexing service, you can configure supervisors on the Overlord to manage the creation and lifetime of Kinesis indexing tasks. These indexing tasks read events using Kinesis' own shard and sequence number mechanism to guarantee exactly-once ingestion. The supervisor oversees the state of the indexing tasks to coordinate handoffs, manage failures, and ensure that scalability and replication requirements are maintained. -- coordinate handoffs -- manage failures -- ensure that scalability and replication requirements are maintained. +This topic contains configuration reference information for the Kinesis indexing service supervisor for Apache Druid. -To use the Kinesis indexing service, load the `druid-kinesis-indexing-service` core Apache Druid extension (see -[Including Extensions](../../configuration/extensions.md#loading-extensions)). +## Setup -> Before you deploy the Kinesis extension to production, read the [Kinesis known issues](#kinesis-known-issues). +To use the Kinesis indexing service, you must first load the `druid-kinesis-indexing-service` core extension on both the Overlord and the Middle Manager. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information. +We recommend that you review the [Kinesis known issues](#kinesis-known-issues) before deploying the `druid-kinesis-indexing-service` extension to production. -## Submitting a Supervisor Spec +## Supervisor spec -To use the Kinesis indexing service, load the `druid-kinesis-indexing-service` extension on both the Overlord and the MiddleManagers. Druid starts a supervisor for a dataSource when you submit a supervisor spec. Submit your supervisor spec to the following endpoint: +The following table outlines the high-level configuration options for the Kinesis supervisor object. +See [Supervisor API](../../api-reference/supervisor-api.md) for more information. -`http://<OVERLORD_IP>:<OVERLORD_PORT>/druid/indexer/v1/supervisor` +|Property|Type|Description|Required| +|--------|----|-----------|--------| +|`type`|String|The supervisor type; this should always be `kinesis`.|Yes| +|`spec`|Object|The container object for the supervisor configuration.|Yes| +|`ioConfig`|Object|The [I/O configuration](#supervisor-io-configuration) object for configuring Kafka connection and I/O-related settings for the supervisor and indexing task.|Yes| +|`dataSchema`|Object|The schema used by the Kinesis indexing task during ingestion. See [`dataSchema`](../../ingestion/ingestion-spec.md#dataschema) for more information.|Yes| +|`tuningConfig`|Object|The [tuning configuration](#supervisor-tuning-configuration) object for configuring performance-related settings for the supervisor and indexing tasks.|No| -For example: +Druid starts a new supervisor when you define a supervisor spec. +To create a supervisor, send a `POST` request to the `/druid/indexer/v1/supervisor` endpoint. +Once created, the supervisor persists in the configured metadata database. There can only be a single supervisor per datasource, and submitting a second spec for the same datasource overwrites the previous one. -```sh -curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http://localhost:8090/druid/indexer/v1/supervisor -``` +When an Overlord gains leadership, either by being started or as a result of another Overlord failing, it spawns +a supervisor for each supervisor spec in the metadata database. The supervisor then discovers running Kinesis indexing +tasks and attempts to adopt them if they are compatible with the supervisor's configuration. If they are not +compatible because they have a different ingestion spec or shard allocation, the tasks are killed and the +supervisor creates a new set of tasks. In this way, the supervisors persist across Overlord restarts and failovers. -Where the file `supervisor-spec.json` contains a Kinesis supervisor spec: +The following example shows how to submit a supervisor spec for a stream with the name `KinesisStream`. +In this example, `http://SERVICE_IP:SERVICE_PORT` is a placeholder for the server address of deployment and the service port. -```json -{ +<!--DOCUSAURUS_CODE_TABS--> + +<!--cURL--> +```shell +curl -X POST "http://SERVICE_IP:SERVICE_PORT/druid/indexer/v1/supervisor" \ +-H "Content-Type: application/json" \ +-d '{ "type": "kinesis", "spec": { + "ioConfig": { + "type": "kinesis", + "stream": "KinesisStream", + "inputFormat": { + "type": "json" + }, + "useEarliestSequenceNumber": true + }, + "tuningConfig": { + "type": "kinesis" + }, "dataSchema": { - "dataSource": "metrics-kinesis", + "dataSource": "KinesisStream", "timestampSpec": { "column": "timestamp", - "format": "auto" + "format": "iso" }, - "dimensionsSpec": { - "dimensions": [], - "dimensionExclusions": [ - "timestamp", - "value" - ] + "dimensionsSpec": { + "dimensions": [ + "isRobot", + "channel", + "flags", + "isUnpatrolled", + "page", + "diffUrl", + { + "type": "long", + "name": "added" + }, + "comment", + { + "type": "long", + "name": "commentLength" + }, + "isNew", + "isMinor", + { + "type": "long", + "name": "delta" + }, + "isAnonymous", + "user", + { + "type": "long", + "name": "deltaBucket" + }, + { + "type": "long", + "name": "deleted" + }, + "namespace", + "cityName", + "countryName", + "regionIsoCode", + "metroCode", + "countryIsoCode", + "regionName" + ] }, - "metricsSpec": [ - { - "name": "count", - "type": "count" - }, - { - "name": "value_sum", - "fieldName": "value", - "type": "doubleSum" - }, - { - "name": "value_min", - "fieldName": "value", - "type": "doubleMin" - }, - { - "name": "value_max", - "fieldName": "value", - "type": "doubleMax" - } - ], - "granularitySpec": { - "type": "uniform", - "segmentGranularity": "HOUR", - "queryGranularity": "NONE" - } - }, + "granularitySpec": { + "queryGranularity": "none", + "rollup": false, + "segmentGranularity": "hour" + } + } + } +}' +``` +<!--HTTP--> +```HTTP +POST /druid/indexer/v1/supervisor Review Comment: same comment as line 65 wrt/ router vs indexer ########## docs/development/extensions-core/kinesis-ingestion.md: ########## @@ -23,154 +23,259 @@ sidebar_label: "Amazon Kinesis" ~ under the License. --> -When you enable the Kinesis indexing service, you can configure *supervisors* on the Overlord to manage the creation and lifetime of Kinesis indexing tasks. These indexing tasks read events using Kinesis' own shard and sequence number mechanism to guarantee exactly-once ingestion. The supervisor oversees the state of the indexing tasks to: +When you enable the Kinesis indexing service, you can configure supervisors on the Overlord to manage the creation and lifetime of Kinesis indexing tasks. These indexing tasks read events using Kinesis' own shard and sequence number mechanism to guarantee exactly-once ingestion. The supervisor oversees the state of the indexing tasks to coordinate handoffs, manage failures, and ensure that scalability and replication requirements are maintained. -- coordinate handoffs -- manage failures -- ensure that scalability and replication requirements are maintained. +This topic contains configuration reference information for the Kinesis indexing service supervisor for Apache Druid. -To use the Kinesis indexing service, load the `druid-kinesis-indexing-service` core Apache Druid extension (see -[Including Extensions](../../configuration/extensions.md#loading-extensions)). +## Setup -> Before you deploy the Kinesis extension to production, read the [Kinesis known issues](#kinesis-known-issues). +To use the Kinesis indexing service, you must first load the `druid-kinesis-indexing-service` core extension on both the Overlord and the Middle Manager. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information. +We recommend that you review the [Kinesis known issues](#kinesis-known-issues) before deploying the `druid-kinesis-indexing-service` extension to production. -## Submitting a Supervisor Spec +## Supervisor spec -To use the Kinesis indexing service, load the `druid-kinesis-indexing-service` extension on both the Overlord and the MiddleManagers. Druid starts a supervisor for a dataSource when you submit a supervisor spec. Submit your supervisor spec to the following endpoint: +The following table outlines the high-level configuration options for the Kinesis supervisor object. +See [Supervisor API](../../api-reference/supervisor-api.md) for more information. -`http://<OVERLORD_IP>:<OVERLORD_PORT>/druid/indexer/v1/supervisor` +|Property|Type|Description|Required| +|--------|----|-----------|--------| +|`type`|String|The supervisor type; this should always be `kinesis`.|Yes| +|`spec`|Object|The container object for the supervisor configuration.|Yes| +|`ioConfig`|Object|The [I/O configuration](#supervisor-io-configuration) object for configuring Kafka connection and I/O-related settings for the supervisor and indexing task.|Yes| +|`dataSchema`|Object|The schema used by the Kinesis indexing task during ingestion. See [`dataSchema`](../../ingestion/ingestion-spec.md#dataschema) for more information.|Yes| +|`tuningConfig`|Object|The [tuning configuration](#supervisor-tuning-configuration) object for configuring performance-related settings for the supervisor and indexing tasks.|No| -For example: +Druid starts a new supervisor when you define a supervisor spec. +To create a supervisor, send a `POST` request to the `/druid/indexer/v1/supervisor` endpoint. +Once created, the supervisor persists in the configured metadata database. There can only be a single supervisor per datasource, and submitting a second spec for the same datasource overwrites the previous one. -```sh -curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http://localhost:8090/druid/indexer/v1/supervisor -``` +When an Overlord gains leadership, either by being started or as a result of another Overlord failing, it spawns +a supervisor for each supervisor spec in the metadata database. The supervisor then discovers running Kinesis indexing +tasks and attempts to adopt them if they are compatible with the supervisor's configuration. If they are not +compatible because they have a different ingestion spec or shard allocation, the tasks are killed and the +supervisor creates a new set of tasks. In this way, the supervisors persist across Overlord restarts and failovers. -Where the file `supervisor-spec.json` contains a Kinesis supervisor spec: +The following example shows how to submit a supervisor spec for a stream with the name `KinesisStream`. +In this example, `http://SERVICE_IP:SERVICE_PORT` is a placeholder for the server address of deployment and the service port. -```json -{ +<!--DOCUSAURUS_CODE_TABS--> + +<!--cURL--> +```shell +curl -X POST "http://SERVICE_IP:SERVICE_PORT/druid/indexer/v1/supervisor" \ +-H "Content-Type: application/json" \ +-d '{ "type": "kinesis", "spec": { + "ioConfig": { + "type": "kinesis", + "stream": "KinesisStream", + "inputFormat": { + "type": "json" + }, + "useEarliestSequenceNumber": true + }, + "tuningConfig": { + "type": "kinesis" + }, "dataSchema": { - "dataSource": "metrics-kinesis", + "dataSource": "KinesisStream", "timestampSpec": { "column": "timestamp", - "format": "auto" + "format": "iso" }, - "dimensionsSpec": { - "dimensions": [], - "dimensionExclusions": [ - "timestamp", - "value" - ] + "dimensionsSpec": { + "dimensions": [ + "isRobot", + "channel", + "flags", + "isUnpatrolled", + "page", + "diffUrl", + { + "type": "long", + "name": "added" + }, + "comment", + { + "type": "long", + "name": "commentLength" + }, + "isNew", + "isMinor", + { + "type": "long", + "name": "delta" + }, + "isAnonymous", + "user", + { + "type": "long", + "name": "deltaBucket" + }, + { + "type": "long", + "name": "deleted" + }, + "namespace", + "cityName", + "countryName", + "regionIsoCode", + "metroCode", + "countryIsoCode", + "regionName" + ] }, - "metricsSpec": [ - { - "name": "count", - "type": "count" - }, - { - "name": "value_sum", - "fieldName": "value", - "type": "doubleSum" - }, - { - "name": "value_min", - "fieldName": "value", - "type": "doubleMin" - }, - { - "name": "value_max", - "fieldName": "value", - "type": "doubleMax" - } - ], - "granularitySpec": { - "type": "uniform", - "segmentGranularity": "HOUR", - "queryGranularity": "NONE" - } - }, + "granularitySpec": { + "queryGranularity": "none", + "rollup": false, + "segmentGranularity": "hour" + } + } + } +}' +``` +<!--HTTP--> +```HTTP +POST /druid/indexer/v1/supervisor +HTTP/1.1 +Host: http://SERVICE_IP:SERVICE_PORT +Content-Type: application/json + +{ + "type": "kinesis", + "spec": { "ioConfig": { - "stream": "metrics", - "inputFormat": { - "type": "json" + "type": "kinesis", + "stream": "KinesisStream", + "inputFormat": { + "type": "json" + }, + "useEarliestSequenceNumber": true + }, + "tuningConfig": { + "type": "kinesis" + }, + "dataSchema": { + "dataSource": "KinesisStream", + "timestampSpec": { + "column": "timestamp", + "format": "iso" + }, + "dimensionsSpec": { + "dimensions": [ + "isRobot", + "channel", + "flags", + "isUnpatrolled", + "page", + "diffUrl", + { + "type": "long", + "name": "added" + }, + "comment", + { + "type": "long", + "name": "commentLength" + }, + "isNew", + "isMinor", + { + "type": "long", + "name": "delta" + }, + "isAnonymous", + "user", + { + "type": "long", + "name": "deltaBucket" + }, + { + "type": "long", + "name": "deleted" + }, + "namespace", + "cityName", + "countryName", + "regionIsoCode", + "metroCode", + "countryIsoCode", + "regionName" + ] }, - "endpoint": "kinesis.us-east-1.amazonaws.com", - "taskCount": 1, - "replicas": 1, - "taskDuration": "PT1H" - }, - "tuningConfig": { - "type": "kinesis", - "maxRowsPerSegment": 5000000 - } + "granularitySpec": { + "queryGranularity": "none", + "rollup": false, + "segmentGranularity": "hour" + } + } } } ``` - -## Supervisor Spec - -|Field|Description|Required| -|--------|-----------|---------| -|`type`|The supervisor type; this should always be `kinesis`.|yes| -|`spec`|Container object for the supervisor configuration.|yes| -|`dataSchema`|The schema that will be used by the Kinesis indexing task during ingestion. See [`dataSchema`](../../ingestion/ingestion-spec.md#dataschema).|yes| -|`ioConfig`|An [`ioConfig`](#ioconfig) object for configuring Kafka connection and I/O-related settings for the supervisor and indexing task.|yes| -|`tuningConfig`|A [`tuningConfig`](#tuningconfig) object for configuring performance-related settings for the supervisor and indexing tasks.|no| - -### `ioConfig` - -|Field|Type|Description|Required| -|-----|----|-----------|--------| -|`stream`|String|The Kinesis stream to read.|yes| -|`inputFormat`|Object|[`inputFormat`](../../ingestion/data-formats.md#input-format) to specify how to parse input data. See [Specifying data format](#specifying-data-format) for details about specifying the input format.|yes| -|`endpoint`|String|The AWS Kinesis stream endpoint for a region. You can find a list of endpoints [here](http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region).|no (default == kinesis.us-east-1.amazonaws.com)| -|`replicas`|Integer|The number of replica sets, where 1 means a single set of tasks (no replication). Replica tasks will always be assigned to different workers to provide resiliency against process failure.|no (default == 1)| -|`taskCount`|Integer|The maximum number of *reading* tasks in a *replica set*. This means that the maximum number of reading tasks will be `taskCount * replicas` and the total number of tasks (*reading* + *publishing*) will be higher than this. See [Capacity Planning](#capacity-planning) below for more details. The number of reading tasks will be less than `taskCount` if `taskCount > {numKinesisShards}`.|no (default == 1)| -|`taskDuration`|ISO8601 Period|The length of time before tasks stop reading and begin publishing their segment.|no (default == PT1H)| -|`startDelay`|ISO8601 Period|The period to wait before the supervisor starts managing tasks.|no (default == PT5S)| -|`period`|ISO8601 Period|How often the supervisor will execute its management logic. Note that the supervisor will also run in response to certain events (such as tasks succeeding, failing, and reaching their taskDuration) so this value specifies the maximum time between iterations.|no (default == PT30S)| -|`useEarliestSequenceNumber`|Boolean|If a supervisor is managing a dataSource for the first time, it will obtain a set of starting sequence numbers from Kinesis. This flag determines whether it retrieves the earliest or latest sequence numbers in Kinesis. Under normal circumstances, subsequent tasks will start from where the previous segments ended so this flag will only be used on first run.|no (default == false)| -|`completionTimeout`|ISO8601 Period|The length of time to wait before declaring a publishing task as failed and terminating it. If this is set too low, your tasks may never publish. The publishing clock for a task begins roughly after `taskDuration` elapses.|no (default == PT6H)| -|`lateMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps earlier than this period before the task was created; for example if this is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps earlier than *2016-01-01T11:00Z* will be dropped. This may help prevent concurrency issues if your data stream has late messages and you have multiple pipelines that need to operate on the same segments (e.g. a streaming and a nightly batch ingestion pipeline).|no (default == none)| -|`earlyMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps later than this period after the task reached its taskDuration; for example if this is set to `PT1H`, the taskDuration is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*. Messages with timestamps later than *2016-01-01T14:00Z* will be dropped. **Note:** Tasks sometimes run past their task duration, for example, in cases of supervisor failover. Setting `earlyMessageRejectionPeriod` too low may cause messages to be dropped unexpectedly whenever a task runs past its originally configured task duration.|no (default == none)| -|`recordsPerFetch`|Integer|The number of records to request per call to fetch records from Kinesis. See [Determining fetch settings](#determining-fetch-settings).|no (see [Determining fetch settings](#determining-fetch-settings) for defaults)| -|`fetchDelayMillis`|Integer|Time in milliseconds to wait between subsequent calls to fetch records from Kinesis. See [Determining fetch settings](#determining-fetch-settings).|no (default == 0)| -|`awsAssumedRoleArn`|String|The AWS assumed role to use for additional permissions.|no| -|`awsExternalId`|String|The AWS external id to use for additional permissions.|no| -|`deaggregate`|Boolean|Whether to use the de-aggregate function of the KCL. See below for details.|no| -|`autoScalerConfig`|Object|Defines auto scaling behavior for Kinesis ingest tasks. See [Tasks Autoscaler Properties](#task-autoscaler-properties).|no (default == null)| - -#### Task Autoscaler Properties - -| Property | Description | Required | -| ------------- | ------------- | ------------- | -| `enableTaskAutoScaler` | Enable or disable the auto scaler. When false or absent, Druid disables the `autoScaler` even when `autoScalerConfig` is not null.| no (default == false) | -| `taskCountMax` | Maximum number of Kinesis ingestion tasks. Must be greater than or equal to `taskCountMin`. If greater than `{numKinesisShards}`, the maximum number of reading tasks is `{numKinesisShards}` and `taskCountMax` is ignored. | yes | -| `taskCountMin` | Minimum number of Kinesis ingestion tasks. When you enable the auto scaler, Druid ignores the value of taskCount in `IOConfig` and uses`taskCountMin` for the initial number of tasks to launch.| yes | -| `minTriggerScaleActionFrequencyMillis` | Minimum time interval between two scale actions | no (default == 600000) | -| `autoScalerStrategy` | The algorithm of `autoScaler`. ONLY `lagBased` is supported for now. See [Lag Based AutoScaler Strategy Related Properties](#lag-based-autoscaler-strategy-related-properties) for details.| no (default == `lagBased`) | - -##### Lag Based AutoScaler Strategy Related Properties - -The Kinesis indexing service reports lag metrics measured in time milliseconds rather than message count which is used by Kafka. - -| Property | Description | Required | -| ------------- | ------------- | ------------- | -| `lagCollectionIntervalMillis` | Period of lag points collection. | no (default == 30000) | -| `lagCollectionRangeMillis` | The total time window of lag collection, Use with `lagCollectionIntervalMillis`,it means that in the recent `lagCollectionRangeMillis`, collect lag metric points every `lagCollectionIntervalMillis`. | no (default == 600000) | -| `scaleOutThreshold` | The Threshold of scale out action | no (default == 6000000) | -| `triggerScaleOutFractionThreshold` | If `triggerScaleOutFractionThreshold` percent of lag points are higher than `scaleOutThreshold`, then do scale out action. | no (default == 0.3) | -| `scaleInThreshold` | The Threshold of scale in action | no (default == 1000000) | -| `triggerScaleInFractionThreshold` | If `triggerScaleInFractionThreshold` percent of lag points are lower than `scaleOutThreshold`, then do scale in action. | no (default == 0.9) | -| `scaleActionStartDelayMillis` | Number of milliseconds to delay after the supervisor starts before the first scale logic check. | no (default == 300000) | -| `scaleActionPeriodMillis` | Frequency in milliseconds to check if a scale action is triggered | no (default == 60000) | -| `scaleInStep` | Number of tasks to reduce at a time when scaling down | no (default == 1) | -| `scaleOutStep` | Number of tasks to add at a time when scaling out | no (default == 2) | - -The following example demonstrates a supervisor spec with `lagBased` autoScaler enabled: +<!--END_DOCUSAURUS_CODE_TABS--> + +## Supervisor I/O configuration + +The following table outlines the configuration options for `ioConfig`: + +|Property|Type|Description|Required|Default| +|--------|----|-----------|--------|-------| +|`stream`|String|The Kinesis stream to read.|Yes|| +|`inputFormat`|Object|The [input format](../../ingestion/data-formats.md#input-format) to specify how to parse input data. See [Specify data format](#specify-data-format) for more information.|Yes|| +|`endpoint`|String|The AWS Kinesis stream endpoint for a region. You can find a list of endpoints in the [AWS service endpoints](http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region) document.|No|`kinesis.us-east-1.amazonaws.com`| +|`replicas`|Integer|The number of replica sets, where 1 is a single set of tasks (no replication). Druid always assigns replicate tasks to different workers to provide resiliency against process failure.|No|1| +|`taskCount`|Integer|The maximum number of reading tasks in a replica set. Multiply `taskCount` and `replicas` to measure the maximum number of reading tasks. <br />The total number of tasks (reading and publishing) is higher than the maximum number of reading tasks. See [Capacity planning](#capacity-planning) for more details. When `taskCount > {numKinesisShards}`, the actual number of reading tasks is less than the `taskCount` value.|No|1| +|`taskDuration`|ISO 8601 period|The length of time before tasks stop reading and begin publishing their segments.|No|PT1H| +|`startDelay`|ISO 8601 period|The period to wait before the supervisor starts managing tasks.|No|PT5S| +|`period`|ISO 8601 period|Determines how often the supervisor executes its management logic. Note that the supervisor also runs in response to certain events, such as tasks succeeding, failing, and reaching their task duration, so this value specifies the maximum time between iterations.|No|PT30S| +|`useEarliestSequenceNumber`|Boolean|If a supervisor is managing a datasource for the first time, it obtains a set of starting sequence numbers from Kinesis. This flag determines whether a supervisor retrieves the earliest or latest sequence numbers in Kinesis. Under normal circumstances, subsequent tasks start from where the previous segments ended so this flag is only used on the first run.|No|`false`| +|`completionTimeout`|ISO 8601 period|The length of time to wait before Druid declares a publishing task has failed and terminates it. If this is set too low, your tasks may never publish. The publishing clock for a task begins roughly after `taskDuration` elapses.|No|PT6H| +|`lateMessageRejectionPeriod`|ISO 8601 period|Configure tasks to reject messages with timestamps earlier than this period before the task is created. For example, if `lateMessageRejectionPeriod` is set to `PT1H` and the supervisor creates a task at `2016-01-01T12:00Z`, messages with timestamps earlier than `2016-01-01T11:00Z` are dropped. This may help prevent concurrency issues if your data stream has late messages and you have multiple pipelines that need to operate on the same segments, such as a streaming and a nightly batch ingestion pipeline.|No|| +|`earlyMessageRejectionPeriod`|ISO 8601 period|Configure tasks to reject messages with timestamps later than this period after the task reached its `taskDuration`. For example, if `earlyMessageRejectionPeriod` is set to `PT1H`, the `taskDuration` is set to `PT1H` and the supervisor creates a task at `2016-01-01T12:00Z`. Messages with timestamps later than `2016-01-01T14:00Z` are dropped. **Note:** Tasks sometimes run past their task duration, for example, in cases of supervisor failover. Setting `earlyMessageRejectionPeriod` too low may cause messages to be dropped unexpectedly whenever a task runs past its originally configured task duration.|No|| +|`recordsPerFetch`|Integer|The number of records to request per call to fetch records from Kinesis.|No| See [Determine fetch settings](#determine-fetch-settings) for defaults.| +|`fetchDelayMillis`|Integer|Time in milliseconds to wait between subsequent calls to fetch records from Kinesis. See [Determine fetch settings](#determine-fetch-settings).|No|0| +|`awsAssumedRoleArn`|String|The AWS assumed role to use for additional permissions.|No|| +|`awsExternalId`|String|The AWS external ID to use for additional permissions.|No|| +|`deaggregate`|Boolean|Whether to use the deaggregate function of the Kinesis Client Library (KCL).|No|| +|`autoScalerConfig`|Object|Defines autoscaling behavior for Kinesis ingest tasks. See [Task autoscaler properties](#task-autoscaler-properties) for more information.|No|null| + +### Task autoscaler properties + +The following table outlines the configuration options for `autoScalerConfig`: + +|Property|Description|Required|Default| +|--------|-----------|--------|-------| +|`enableTaskAutoScaler`|Enables the auto scaler. If not specified, Druid disables the auto scaler even when `autoScalerConfig` is not null.|No|`false`| +|`taskCountMax`|Maximum number of Kinesis ingestion tasks. Must be greater than or equal to `taskCountMin`. If greater than `{numKinesisShards}`, Druid sets the maximum number of reading tasks to `{numKinesisShards}` and ignores `taskCountMax`.|Yes|| +|`taskCountMin`|Minimum number of Kinesis ingestion tasks. When you enable the auto scaler, Druid ignores the value of `taskCount` in `IOConfig` and uses `taskCountMin` for the initial number of tasks to launch.|Yes|| +|`minTriggerScaleActionFrequencyMillis`|Minimum time interval between two scale actions.| No|600000| +|`autoScalerStrategy`|The algorithm of `autoScaler`. Druid only supports the `lagBased` strategy. See [Lag based autoscaler strategy related properties](#lag-based-autoscaler-strategy-related-properties) for more information.|No|Defaults to `lagBased`.| + +### Lag based autoscaler strategy related properties + +Unlike the Kafka +indexing service, Kinesis reports lag metrics measured in time difference in milliseconds between the current sequence number and latest sequence number, rather than message count. Review Comment: ```suggestion Unlike the Kafka indexing service, Kinesis reports lag metrics measured in time difference in milliseconds between the current sequence number and latest sequence number, rather than message count. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
