techdocsmith commented on a change in pull request #11490:
URL: https://github.com/apache/druid/pull/11490#discussion_r734892456
##########
File path: docs/development/extensions-core/s3.md
##########
@@ -36,7 +36,7 @@ The [S3 input
source](../../ingestion/native-batch.md#s3-input-source) is suppor
to read objects directly from S3. If you use the [Hadoop
task](../../ingestion/hadoop.md),
you can read data from S3 by specifying the S3 paths in your
[`inputSpec`](../../ingestion/hadoop.md#inputspec).
-To configure the extension to read objects from S3 you need to configure how
to [connect to S3](#configuration).
+To configure the extension to read objects from S3 you need to configure Druid
to [connect to S3](#configuration).
Review comment:
```suggestion
To configure the extension to read objects from S3, supply the S3
[connection information](#configuration).
```
##########
File path: docs/development/extensions-core/s3.md
##########
@@ -76,14 +77,15 @@ Druid uses the following credentials provider chain to
connect to your S3 bucket
|6|ECS container credentials|Based on environment variables available on AWS
ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or
AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the
[EC2ContainerCredentialsProviderWrapper
documentation](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.html)|
|7|Instance profile information|Based on the instance profile you may have
attached to your druid instance|
-You can find more information about authentication method
[here](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials)<br/>
-**Note :** *Order is important here as it indicates the precedence of
authentication methods.<br/>
-So if you are trying to use Instance profile information, you **must not** set
`druid.s3.accessKey` and `druid.s3.secretKey` in your Druid runtime.properties*
+> You can find more information about authentication methods in the [Amazon
Developer
Guide](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials).
+
+> Order is important here as it indicates the precedence of authentication
methods. If you are trying to use Instance profile information, you **must
not** set `druid.s3.accessKey` and `druid.s3.secretKey` in your Druid
runtime.properties.
+> You can use the property
[`druid.startup.logging.maskProperties`](../../configuration/index.html#startup-logging)
to mask credentials information in Druid logs. For example, `["password",
"secretKey", "awsSecretAccessKey"]`.
Review comment:
```suggestion
You can use the property
[`druid.startup.logging.maskProperties`](../../configuration/index.html#startup-logging)
to mask credentials information in Druid logs. For example, `["password",
"secretKey", "awsSecretAccessKey"]`.
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
Review comment:
```suggestion
|`type`|Set value to `s3`.|None|yes|
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
Review comment:
```suggestion
Specify objects to ingest as either:
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
+
+> You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
+
+> The S3 input source will skip all empty objects only when `prefixes` is
specified.
+
+#### S3 Input Objects
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`bucket`|Name of the S3 bucket|None|yes|
+|`path`|The path where data is located.|None|yes|
+
+#### S3 Input Properties Object
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`accessKeyId`|The [Password Provider](../operations/password-provider.md) or
plain text string of this S3 InputSource's access key|None|yes if
secretAccessKey is given|
+|`secretAccessKey`|The [Password Provider](../operations/password-provider.md)
or plain text string of this S3 InputSource's secret key|None|yes if
accessKeyId is given|
+|`assumeRoleArn`|AWS ARN of the role to assume. See the [AWS User
Guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html).
`assumeRoleArn` can be used either with the ingestion spec AWS credentials or
with the default S3 credentials|None|no|
+|`assumeRoleExternalId`|A unique identifier that might be required when you
assume a role in another account. See the [AWS User
Guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html).|None|no|
+
+> If `accessKeyId` and `secretAccessKey` are not given, then the default [S3
credentials provider
chain](../development/extensions-core/s3.md#s3-authentication-methods) is used.
Review comment:
```suggestion
If you do not supply an `accessKeyId` and `secretAccessKey`, Druid uses the
default [S3 credentials provider
chain](../development/extensions-core/s3.md#s3-authentication-methods).
```
##########
File path: docs/development/extensions-core/s3.md
##########
@@ -64,7 +64,8 @@ In addition to this you need to set additional configuration,
specific for [deep
### S3 authentication methods
Druid uses the following credentials provider chain to connect to your S3
bucket (whether a deep storage bucket or source bucket).
-**Note :** *You can override the default credentials provider chain for
connecting to source bucket by specifying an access key and secret key using
[Properties Object](../../ingestion/native-batch.md#s3-input-source) parameters
in the ingestionSpec.*
+
+> You can override the default credentials provider chain for connecting to
the source bucket by specifying an access key and secret key using [Properties
Object](../../ingestion/native-batch.md#s3-input-source) parameters in the
ingestion specification.
Review comment:
```suggestion
> To override the default credentials provider chain for connecting to the
source bucket, specify an access key and secret key using [Properties
Object](../../ingestion/native-batch.md#s3-input-source) parameters in the
ingestion specification.
```
##########
File path: docs/development/extensions-core/s3.md
##########
@@ -76,14 +77,15 @@ Druid uses the following credentials provider chain to
connect to your S3 bucket
|6|ECS container credentials|Based on environment variables available on AWS
ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or
AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the
[EC2ContainerCredentialsProviderWrapper
documentation](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.html)|
|7|Instance profile information|Based on the instance profile you may have
attached to your druid instance|
-You can find more information about authentication method
[here](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials)<br/>
-**Note :** *Order is important here as it indicates the precedence of
authentication methods.<br/>
-So if you are trying to use Instance profile information, you **must not** set
`druid.s3.accessKey` and `druid.s3.secretKey` in your Druid runtime.properties*
+> You can find more information about authentication methods in the [Amazon
Developer
Guide](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials).
Review comment:
```suggestion
Fore more information, refer to the [Amazon Developer
Guide](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials).
```
##########
File path: docs/development/extensions-core/s3.md
##########
@@ -76,14 +77,15 @@ Druid uses the following credentials provider chain to
connect to your S3 bucket
|6|ECS container credentials|Based on environment variables available on AWS
ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or
AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the
[EC2ContainerCredentialsProviderWrapper
documentation](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.html)|
|7|Instance profile information|Based on the instance profile you may have
attached to your druid instance|
-You can find more information about authentication method
[here](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials)<br/>
-**Note :** *Order is important here as it indicates the precedence of
authentication methods.<br/>
-So if you are trying to use Instance profile information, you **must not** set
`druid.s3.accessKey` and `druid.s3.secretKey` in your Druid runtime.properties*
+> You can find more information about authentication methods in the [Amazon
Developer
Guide](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials).
+
+> Order is important here as it indicates the precedence of authentication
methods. If you are trying to use Instance profile information, you **must
not** set `druid.s3.accessKey` and `druid.s3.secretKey` in your Druid
runtime.properties.
Review comment:
```suggestion
The order of configuration parameters is important here because it indicates
the precedence of authentication methods. If you are trying to use Instance
profile information, do not set `druid.s3.accessKey` and `druid.s3.secretKey`
in your Druid runtime.properties.
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
Review comment:
```suggestion
|`uris`| JSON array of URIs defining the location of S3 objects to ingest
|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be set|
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
Review comment:
```suggestion
The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). In this case each `index_parallel` task reads one or
more objects.
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
+
+> You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
+
+> The S3 input source will skip all empty objects only when `prefixes` is
specified.
Review comment:
```suggestion
The S3 input source skips all empty objects only when `prefixes` is
specified.
```
##########
File path: docs/development/extensions-core/s3.md
##########
@@ -76,14 +77,15 @@ Druid uses the following credentials provider chain to
connect to your S3 bucket
|6|ECS container credentials|Based on environment variables available on AWS
ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or
AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the
[EC2ContainerCredentialsProviderWrapper
documentation](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.html)|
|7|Instance profile information|Based on the instance profile you may have
attached to your druid instance|
-You can find more information about authentication method
[here](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials)<br/>
-**Note :** *Order is important here as it indicates the precedence of
authentication methods.<br/>
-So if you are trying to use Instance profile information, you **must not** set
`druid.s3.accessKey` and `druid.s3.secretKey` in your Druid runtime.properties*
+> You can find more information about authentication methods in the [Amazon
Developer
Guide](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials).
+
+> Order is important here as it indicates the precedence of authentication
methods. If you are trying to use Instance profile information, you **must
not** set `druid.s3.accessKey` and `druid.s3.secretKey` in your Druid
runtime.properties.
+> You can use the property
[`druid.startup.logging.maskProperties`](../../configuration/index.html#startup-logging)
to mask credentials information in Druid logs. For example, `["password",
"secretKey", "awsSecretAccessKey"]`.
### S3 permissions settings
-`s3:GetObject` and `s3:PutObject` are basically required for pushing/loading
segments to/from S3.
+`s3:GetObject` and `s3:PutObject` are required for pushing / pulling segments
to / from S3.
Review comment:
```suggestion
`s3:GetObject` and `s3:PutObject` are required for pushing or pulling
segments to or from S3.
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
Review comment:
```suggestion
|[`properties`](#s3-input-properties-object)|Properties Object to override
the default S3 configuration.|None|No (defaults will be used if not given)
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
Review comment:
```suggestion
- a list of S3 URI strings
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
+
+> You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
+
+> The S3 input source will skip all empty objects only when `prefixes` is
specified.
+
+#### S3 Input Objects
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`bucket`|Name of the S3 bucket|None|yes|
+|`path`|The path where data is located.|None|yes|
+
+#### S3 Input Properties Object
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`accessKeyId`|The [Password Provider](../operations/password-provider.md) or
plain text string of this S3 InputSource's access key|None|yes if
secretAccessKey is given|
+|`secretAccessKey`|The [Password Provider](../operations/password-provider.md)
or plain text string of this S3 InputSource's secret key|None|yes if
accessKeyId is given|
+|`assumeRoleArn`|AWS ARN of the role to assume. See the [AWS User
Guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html).
`assumeRoleArn` can be used either with the ingestion spec AWS credentials or
with the default S3 credentials|None|no|
+|`assumeRoleExternalId`|A unique identifier that might be required when you
assume a role in another account. See the [AWS User
Guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html).|None|no|
+
+> If `accessKeyId` and `secretAccessKey` are not given, then the default [S3
credentials provider
chain](../development/extensions-core/s3.md#s3-authentication-methods) is used.
+
+#### S3 Input Examples
Review comment:
```suggestion
#### S3 input examples
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
Review comment:
```suggestion
all objects contained in the specified prefixes.
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
+
+> You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
+
+> The S3 input source will skip all empty objects only when `prefixes` is
specified.
+
+#### S3 Input Objects
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`bucket`|Name of the S3 bucket|None|yes|
+|`path`|The path where data is located.|None|yes|
Review comment:
```suggestion
|`path`|The path to the data|None|yes|
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
+
+> You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
+
+> The S3 input source will skip all empty objects only when `prefixes` is
specified.
+
+#### S3 Input Objects
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`bucket`|Name of the S3 bucket|None|yes|
+|`path`|The path where data is located.|None|yes|
+
+#### S3 Input Properties Object
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`accessKeyId`|The [Password Provider](../operations/password-provider.md) or
plain text string of this S3 InputSource's access key|None|yes if
secretAccessKey is given|
+|`secretAccessKey`|The [Password Provider](../operations/password-provider.md)
or plain text string of this S3 InputSource's secret key|None|yes if
accessKeyId is given|
+|`assumeRoleArn`|AWS ARN of the role to assume. See the [AWS User
Guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html).
`assumeRoleArn` can be used either with the ingestion spec AWS credentials or
with the default S3 credentials|None|no|
+|`assumeRoleExternalId`|A unique identifier that might be required when you
assume a role in another account. See the [AWS User
Guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html).|None|no|
+
+> If `accessKeyId` and `secretAccessKey` are not given, then the default [S3
credentials provider
chain](../development/extensions-core/s3.md#s3-authentication-methods) is used.
+
+#### S3 Input Examples
+
+Using URIs, this ingestion specification will ingest two specific objects:
Review comment:
```suggestion
Using URIs, the following ingestion specification ingests two specific
objects:
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
Review comment:
```suggestion
|`prefixes`| JSON array of URIs defining the URI prefixes for the locations
of S3 objects to ingest. Druid skips empty objects starting with one of the
given prefixes.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects)
must be set|
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
+
+> You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
+
+> The S3 input source will skip all empty objects only when `prefixes` is
specified.
+
+#### S3 Input Objects
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`bucket`|Name of the S3 bucket|None|yes|
+|`path`|The path where data is located.|None|yes|
+
+#### S3 Input Properties Object
Review comment:
```suggestion
#### S3 input properties object
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -941,33 +985,7 @@ Sample specs:
...
```
-|property|description|default|required?|
-|--------|-----------|-------|---------|
-|type|This should be `s3`.|None|yes|
-|uris|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or `objects` must be set|
-|prefixes|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or `objects` must be set|
-|objects|JSON array of S3 Objects to be ingested.|None|`uris` or `prefixes` or
`objects` must be set|
-|properties|Properties Object for overriding the default S3 configuration. See
below for more information.|None|No (defaults will be used if not given)
-
-Note that the S3 input source will skip all empty objects only when `prefixes`
is specified.
-
-S3 Object:
-
-|property|description|default|required?|
-|--------|-----------|-------|---------|
-|bucket|Name of the S3 bucket|None|yes|
-|path|The path where data is located.|None|yes|
-
-Properties Object:
-
-|property|description|default|required?|
-|--------|-----------|-------|---------|
-|accessKeyId|The [Password Provider](../operations/password-provider.md) or
plain text string of this S3 InputSource's access key|None|yes if
secretAccessKey is given|
-|secretAccessKey|The [Password Provider](../operations/password-provider.md)
or plain text string of this S3 InputSource's secret key|None|yes if
accessKeyId is given|
-|assumeRoleArn|AWS ARN of the role to assume
[see](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html).
**assumeRoleArn** can be used either with the ingestion spec AWS credentials
or with the default S3 credentials|None|no|
-|assumeRoleExternalId|A unique identifier that might be required when you
assume a role in another account
[see](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)|None|no|
-
-**Note :** *If accessKeyId and secretAccessKey are not given, the default [S3
credentials provider
chain](../development/extensions-core/s3.md#s3-authentication-methods) is used.*
+> Read more about S3 and Druid on the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension page,
including using S3-like for [Deep Storage](../dependencies/deep-storage.html),
more about authentication, and additional configuration options.
Review comment:
```suggestion
Learn more about S3 and Druid on the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension page,
including using S3-like for [Deep Storage](../dependencies/deep-storage.html),
more about authentication, and additional configuration options.
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -900,6 +940,8 @@ Sample specs:
...
```
+This ingestion specification provides task-specific credentials to ingest two
specific objects:
Review comment:
```suggestion
The following ingestion specification provides task-specific credentials to
ingest two specific objects:
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
Review comment:
```suggestion
|[`objects`](#s3-input-objects)|JSON array of S3 Objects to
ingest.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be set|
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -880,6 +919,7 @@ Sample specs:
...
```
+This time using `objects`, this specification will ingest two specific
objects, one from the `foo` bucket, one from the `bar` bucket:
Review comment:
```suggestion
The following example uses `objects` to ingest two specific objects, one
from the `foo` bucket, one from the `bar` bucket:
```
when possible opt for "real world" examples over "foo" & "bar"
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
+
+> You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
+
+> The S3 input source will skip all empty objects only when `prefixes` is
specified.
+
+#### S3 Input Objects
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`bucket`|Name of the S3 bucket|None|yes|
+|`path`|The path where data is located.|None|yes|
+
+#### S3 Input Properties Object
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`accessKeyId`|The [Password Provider](../operations/password-provider.md) or
plain text string of this S3 InputSource's access key|None|yes if
secretAccessKey is given|
+|`secretAccessKey`|The [Password Provider](../operations/password-provider.md)
or plain text string of this S3 InputSource's secret key|None|yes if
accessKeyId is given|
Review comment:
```suggestion
|`secretAccessKey`|The [Password
Provider](../operations/password-provider.md) or plain text string of the S3
InputSource's secret key|None|yes if accessKeyId is given|
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
+
+> You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
Review comment:
```suggestion
You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
+*all* objects contained in the `prefixes` you specify.
+
+> You can view the payload of individual `index_parallel` tasks to see how
Druid has divided up the work of ingestion.
+
+> The S3 input source will skip all empty objects only when `prefixes` is
specified.
+
+#### S3 Input Objects
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`bucket`|Name of the S3 bucket|None|yes|
+|`path`|The path where data is located.|None|yes|
+
+#### S3 Input Properties Object
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`accessKeyId`|The [Password Provider](../operations/password-provider.md) or
plain text string of this S3 InputSource's access key|None|yes if
secretAccessKey is given|
Review comment:
```suggestion
|`accessKeyId`|The [Password Provider](../operations/password-provider.md)
or plain text string of the S3 InputSource's access key|None|yes if
secretAccessKey is given|
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -864,6 +901,8 @@ Sample specs:
...
```
+This specification will ingest all the objects in two locations given in
`prefixes`:
Review comment:
```suggestion
The following specification ingests all the objects in two locations given
in `prefixes`:
```
##########
File path: docs/ingestion/native-batch.md
##########
@@ -837,16 +837,53 @@ Only the native Parallel task and Simple task support the
input source.
### S3 Input Source
-> You need to include the
[`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension
to use the S3 input source.
+Use the *S3 input source* to read objects directly from S3-like storage.
-The S3 input source is to support reading objects directly from S3.
-Objects can be specified either via a list of S3 URI strings or a list of
-S3 location prefixes, which will attempt to list the contents and ingest
-all objects contained in the locations. The S3 input source is splittable
-and can be used by the [Parallel task](#parallel-task),
-where each worker task of `index_parallel` will read one or multiple objects.
+> To ingest from S3-type storage, you need to
[load](../development/extensions.html#loading-extensions) the
[`druid-s3-extensions`](../development/extensions-core/s3.md) extension.
-Sample specs:
+> The S3 input source is splittable, meaning it can be used by the [Parallel
task](#parallel-task). Each `index_parallel` task will then read one or
multiple objects.
+
+Objects to ingest can be specified as:
+
+- a list of S3 URI strings or
+- a list of S3 location prefixes
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|`type`|This must be `s3`.|None|yes|
+|`uris`|JSON array of URIs where S3 objects to be ingested are
located.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|`prefixes`|JSON array of URI prefixes for the locations of S3 objects to be
ingested. Empty objects starting with one of the given prefixes will be
skipped.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`objects`](#s3-input-objects)|JSON array of S3 Objects to be
ingested.|None|`uris` or `prefixes` or [`objects`](#s3-input-objects) must be
set|
+|[`properties`](#s3-input-properties-object)|Properties Object for overriding
the default S3 configuration.|None|No (defaults will be used if not given)
+
+> When you supply a list of `prefixes`, Druid will list the contents and then
ingest
Review comment:
```suggestion
When you supply a list of `prefixes`, Druid lists the contents and then
ingests
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]