[GitHub] [incubator-druid] clintropolis commented on a change in pull request #8903: S3 input source

GitBox Fri, 22 Nov 2019 12:10:58 -0800

clintropolis commented on a change in pull request #8903: S3 input source
URL: https://github.com/apache/incubator-druid/pull/8903#discussion_r349774811


 ##########
 File path: docs/development/extensions-core/s3.md
 ##########
 @@ -98,6 +98,54 @@ You can enable [server-side 
encryption](https://docs.aws.amazon.com/AmazonS3/lat
 - kms: [Server-side encryption with AWS KMS–Managed 
Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html)
 - custom: [Server-side encryption with Customer-Provided Encryption 
Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html)
 
+
+<a name="input-source"></a>
+
+## S3 batch ingestion input source
+
+This extension also provides an input source for Druid native batch ingestion 
to support reading objects directly from S3. Objects can be specified either 
via a list of S3 URI strings or a list of S3 location prefixes, which will 
attempt to list the contents and ingest all objects contained in the locations. 
The S3 input source is splittable and can be used by [native parallel index 
tasks](../../ingestion/native-batch.md#parallel-task), where each worker task 
of `index_parallel` will read a single object.
+
+Sample spec:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "s3",
+        "uris": ["s3://foo/bar/file.json", "s3://bar/foo/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "s3",
+        "prefixes": ["s3://foo/bar", "s3://bar/foo"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|type|This should be `s3`.|N/A|yes|
+|uris|JSON array of URIs where s3 files to be ingested are located.|N/A|`uris` 
or `prefixes` must be set|
+|prefixes|JSON array of URI prefixes for the locations of s3 files to be 
ingested.|N/A|`uris` or `prefixes` must be set|
+
 
 Review comment:
   Ah, I actually wasn't going to document `objects` because it's primarily 
used internally for the splits for parallel subtasks to avoid converting 
bucket/path back into a URI, but I guess if people prefer to put in an array of 
objects instead of an array of uris I guess there is no harm in documenting it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] clintropolis commented on a change in pull request #8903: S3 input source

Reply via email to