[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-16 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367568997
 
 

 ##
 File path: 
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/SinglePhaseParallelIndexTaskRunner.java
 ##
 @@ -70,7 +71,7 @@
   @Override
   public String getName()
   {
-return SinglePhaseSubTask.TYPE;
+return PHASE_NAME;
 
 Review comment:
   Did you mean to include changes to these classes in this PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-16 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367570885
 
 

 ##
 File path: docs/ingestion/index.md
 ##
 @@ -287,44 +289,31 @@ definition is an _ingestion spec_.
 
 Ingestion specs consists of three main components:
 
-- [`dataSchema`](#dataschema), which configures the [datasource 
name](#datasource), [input row parser](#parser),
-   [primary timestamp](#timestampspec), [flattening of nested 
data](#flattenspec) (if needed),
-   [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and 
filters](#transformspec) (if needed).
-- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source 
system and . For more information, see the
+- [`dataSchema`](#dataschema), which configures the [datasource 
name](#datasource),
+   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), 
[metrics](#metricsspec), and [transforms and filters](#transformspec) (if 
needed).
+- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source 
system and how to parse data. For more information, see the
documentation for each [ingestion method](#ingestion-methods).
 - [`tuningConfig`](#tuningconfig), which controls various tuning parameters 
specific to each
   [ingestion method](#ingestion-methods).
 
-Example ingestion spec for task type "index" (native batch):
+Example ingestion spec for task type `parallel_index` (native batch):
 
 ```
 {
-  "type": "index",
+  "type": "parallel_index",
 
 Review comment:
   this should be `index_parallel` - same comment on line 299, 332, 349. I have 
a doc change coming up so I can fix in the next patch as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-16 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367568997
 
 

 ##
 File path: 
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/SinglePhaseParallelIndexTaskRunner.java
 ##
 @@ -70,7 +71,7 @@
   @Override
   public String getName()
   {
-return SinglePhaseSubTask.TYPE;
+return PHASE_NAME;
 
 Review comment:
   Did you mean to include changes to these classes in this PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-16 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367567725
 
 

 ##
 File path: website/package-lock.json
 ##
 @@ -3913,8 +3913,7 @@
 "ansi-regex": {
   "version": "2.1.1",
   "bundled": true,
-  "dev": true,
 
 Review comment:
   just curious why all of these were marked as optional before, but not needed 
any more


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-16 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367575702
 
 

 ##
 File path: docs/ingestion/data-formats.md
 ##
 @@ -63,155 +65,968 @@ _TSV (Delimited)_
 
 Note that the CSV and TSV data do not contain column heads. This becomes 
important when you specify the data for ingesting.
 
+Besides text formats, Druid also supports binary formats such as [Orc](#orc) 
and [Parquet](#parquet) formats.
+
 ## Custom Formats
 
 Druid supports custom data formats and can use the `Regex` parser or the 
`JavaScript` parsers to parse these formats. Please note that using any of 
these parsers for
 parsing data will not be as efficient as writing a native Java parser or using 
an external stream processor. We welcome contributions of new Parsers.
 
-## Configuration
+## Input Format
+
+> The Input Format is a new way to specify the data format of your input data 
which was introduced in 0.17.0.
+Unfortunately, the Input Format doesn't support all data formats or ingestion 
methods supported by Druid yet.
+Especially if you want to use the Hadoop ingestion, you still need to use the 
[Parser](#parser-deprecated).
+If your data is formatted in some format not listed in this section, please 
consider using the Parser instead.
 
-All forms of Druid ingestion require some form of schema object. The format of 
the data to be ingested is specified using the`parseSpec` entry in your 
`dataSchema`.
+All forms of Druid ingestion require some form of schema object. The format of 
the data to be ingested is specified using the `inputFormat` entry in your 
[`ioConfig`](index.md#ioconfig).
 
 ### JSON
 
+The `inputFormat` to load data of JSON format. An example is:
+
+```json
+"ioConfig": {
+  "inputFormat": {
+"type": "json"
+  },
+  ...
+}
+```
+
+The JSON `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|---|--|-|--|
+| type | String | This should say `json`. | yes |
+| flattenSpec | JSON Object | Specifies flattening configuration for nested 
JSON data. See [`flattenSpec`](#flattenspec) for more info. | no |
+| featureSpec | JSON Object | [JSON parser 
features](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) 
supported by Jackson library. Those features will be applied when parsing the 
input JSON data. | no |
+
+### CSV
+
+The `inputFormat` to load data of the CSV format. An example is:
+
+```json
+"ioConfig": {
+  "inputFormat": {
+"type": "csv",
+"columns" : 
["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"]
+  },
+  ...
+}
+```
+
+The CSV `inputFormat` has the following components:
+
+| Field | Type | Description | Required |
+|---|--|-|--|
+| type | String | This should say `csv`. | yes |
+| listDelimiter | String | A custom delimiter for multi-value dimensions. | no 
(default == ctrl+A) |
+| columns | JSON array | Specifies the columns of the data. The columns should 
be in the same order with the columns of your data. | yes if 
`findColumnsFromHeader` is false or missing |
+| findColumnsFromHeader | Boolean | If this is set, the task will find the 
column names from the header row. Note that `skipHeaderRows` will be applied 
before finding column names from the header. For example, if you set 
`skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip 
the first two lines and then extract column information from the third line. 
`columns` will be ignored if this is set to true. | no (default = false if 
`columns` is set; otherwise null) |
+| skipHeaderRows | Integer | If this is set, the task will skip the first 
`skipHeaderRows` rows. | no (default = 0) |
+
+### TSV (Delimited)
+
+```json
+"ioConfig": {
+  "inputFormat": {
+"type": "tsv",
+"columns" : 
["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
+"delimiter":"|"
+  },
+  ...
+}
+```
+
+The `inputFormat` to load data of a delimited format. An example is:
+
+| Field | Type | Description | Required |
+|---|--|-|--|
+| type | String | This should say `tsv`. | yes |
+| delimiter | String | A custom delimiter for data values. | no (default == 
`\t`) |
+| listDelimiter | String | A custom delimiter for multi-value dimensions. | no 
(default == ctrl+A) |
+| columns | JSON array | Specifies the columns of the data. The columns should 
be in the same order with the columns of your data. | yes if 
`findColumnsFromHeader` is false or missing |
+| findColumnsFromHeader | Boolean | If this is set, the task will find the 
column names from the header row. Note that `skipHeaderRows` will be applied 
before finding column names from the 

[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-16 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367561847
 
 

 ##
 File path: docs/development/extensions-core/kafka-ingestion.md
 ##
 @@ -60,22 +60,16 @@ A sample supervisor spec is shown below:
   "type": "kafka",
   "dataSchema": {
 "dataSource": "metrics-kafka",
-"parser": {
 
 Review comment:
   I didn't update the kafka tutorial to use this spec. I can follow up in a 
separate patch


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-15 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367139554
 
 

 ##
 File path: docs/development/extensions-core/hdfs.md
 ##
 @@ -36,49 +36,105 @@ To use this Apache Druid extension, make sure to 
[include](../../development/ext
 |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal 
user name |empty|
 
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path
 to keytab file|empty|
 
-If you are using the Hadoop indexer, set your output directory to be a 
location on Hadoop and it will work.
+Besides the above settings, you also need to include all Hadoop configuration 
files (such as `core-site.xml`, `hdfs-site.xml`)
+in the Druid classpath. One way to do this is copying all those files under 
`${DRUID_HOME}/conf/_common`.
+
+If you are using the Hadoop ingestion, set your output directory to be a 
location on Hadoop and it will work.
 If you want to eagerly authenticate against a secured hadoop/hdfs cluster you 
must set `druid.hadoop.security.kerberos.principal` and 
`druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job 
method that runs `kinit` command periodically.
 
-### Configuration for Google Cloud Storage
+### Configuration for Cloud Storage
+
+You can also use the AWS S3 or the Google Cloud Storage as the deep storage 
via HDFS.
+
+ Configuration for AWS S3
 
-The HDFS extension can also be used for GCS as deep storage.
+To use the AWS S3 as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
 |Property|Possible Values|Description|Default|
 ||---|---|---|
-|`druid.storage.type`|hdfs||Must be set.|
-|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.|
+|`druid.storage.type`|hdfs| |Must be set.|
+|`druid.storage.storageDirectory`|s3a://bucket/example/directory or 
s3n://bucket/example/directory|Path to the deep storage|Must be set.|
 
-All services that need to access GCS need to have the [GCS connector 
jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation)
 in their class path. One option is to place this jar in /lib/ and 
/extensions/druid-hdfs-storage/
+You also need to include the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html),
 especially the `hadoop-aws.jar` in the Druid classpath.
+Run the below command to install the `hadoop-aws.jar` file under 
`${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
 
-Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.
-
-
+```bash
+java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps 
-h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
+cp 
${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar
 ${DRUID_HOME}/extensions/druid-hdfs-storage/
+```
 
-## Native batch ingestion
+Finally, you need to add the below properties in the `core-site.xml`.
+For more configurations, see the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
+
+```xml
+
+  fs.s3a.impl
+  org.apache.hadoop.fs.s3a.S3AFileSystem
+  The implementation class of the S3A Filesystem
+
+
+
+  fs.AbstractFileSystem.s3a.impl
+  org.apache.hadoop.fs.s3a.S3A
+  The implementation class of the S3A 
AbstractFileSystem.
+
+
+
+  fs.s3a.access.key
+  AWS access key ID. Omit for IAM role-based or provider-based 
authentication.
+  your access key
+
+
+
+  fs.s3a.secret.key
+  AWS secret key. Omit for IAM role-based or provider-based 
authentication.
+  your secret key
+
+```
 
-This firehose ingests events from a predefined list of files from a Hadoop 
filesystem.
-This firehose is _splittable_ and can be used by [native parallel index 
tasks](../../ingestion/native-batch.md#parallel-task).
-Since each split represents an HDFS file, each worker task of `index_parallel` 
will read an object.
+ Configuration for Google Cloud Storage
 
-Sample spec:
+To use the Google cloud Storage as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
-```json
-"firehose" : {
-"type" : "hdfs",
-"paths": "/foo/bar,/foo/baz"
-}
+|Property|Possible Values|Description|Default|
+||---|---|---|
+|`druid.storage.type`|hdfs||Must be set.|
+|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the 
deep storage|Must be set.|
+
+All services that need to access GCS need to have the [GCS connector 
jar](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
 in their class path.
+One option is to place this jar in `${DRUID_HOME}/lib/` and 
`${DRUID_HOME}/extensions/druid-hdfs-storage/`.
+
+Finally, you need to add the below properties in the `core-site.xml`.
+For 

[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-15 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367140667
 
 

 ##
 File path: docs/development/extensions-core/hdfs.md
 ##
 @@ -36,49 +36,105 @@ To use this Apache Druid extension, make sure to 
[include](../../development/ext
 |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal 
user name |empty|
 
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path
 to keytab file|empty|
 
-If you are using the Hadoop indexer, set your output directory to be a 
location on Hadoop and it will work.
+Besides the above settings, you also need to include all Hadoop configuration 
files (such as `core-site.xml`, `hdfs-site.xml`)
+in the Druid classpath. One way to do this is copying all those files under 
`${DRUID_HOME}/conf/_common`.
+
+If you are using the Hadoop ingestion, set your output directory to be a 
location on Hadoop and it will work.
 If you want to eagerly authenticate against a secured hadoop/hdfs cluster you 
must set `druid.hadoop.security.kerberos.principal` and 
`druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job 
method that runs `kinit` command periodically.
 
-### Configuration for Google Cloud Storage
+### Configuration for Cloud Storage
+
+You can also use the AWS S3 or the Google Cloud Storage as the deep storage 
via HDFS.
+
+ Configuration for AWS S3
 
-The HDFS extension can also be used for GCS as deep storage.
+To use the AWS S3 as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
 |Property|Possible Values|Description|Default|
 ||---|---|---|
-|`druid.storage.type`|hdfs||Must be set.|
-|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.|
+|`druid.storage.type`|hdfs| |Must be set.|
+|`druid.storage.storageDirectory`|s3a://bucket/example/directory or 
s3n://bucket/example/directory|Path to the deep storage|Must be set.|
 
-All services that need to access GCS need to have the [GCS connector 
jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation)
 in their class path. One option is to place this jar in /lib/ and 
/extensions/druid-hdfs-storage/
+You also need to include the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html),
 especially the `hadoop-aws.jar` in the Druid classpath.
+Run the below command to install the `hadoop-aws.jar` file under 
`${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
 
-Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.
-
-
+```bash
+java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps 
-h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
+cp 
${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar
 ${DRUID_HOME}/extensions/druid-hdfs-storage/
+```
 
-## Native batch ingestion
+Finally, you need to add the below properties in the `core-site.xml`.
+For more configurations, see the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
+
+```xml
+
+  fs.s3a.impl
+  org.apache.hadoop.fs.s3a.S3AFileSystem
+  The implementation class of the S3A Filesystem
+
+
+
+  fs.AbstractFileSystem.s3a.impl
+  org.apache.hadoop.fs.s3a.S3A
+  The implementation class of the S3A 
AbstractFileSystem.
+
+
+
+  fs.s3a.access.key
+  AWS access key ID. Omit for IAM role-based or provider-based 
authentication.
+  your access key
+
+
+
+  fs.s3a.secret.key
+  AWS secret key. Omit for IAM role-based or provider-based 
authentication.
+  your secret key
+
+```
 
-This firehose ingests events from a predefined list of files from a Hadoop 
filesystem.
-This firehose is _splittable_ and can be used by [native parallel index 
tasks](../../ingestion/native-batch.md#parallel-task).
-Since each split represents an HDFS file, each worker task of `index_parallel` 
will read an object.
+ Configuration for Google Cloud Storage
 
-Sample spec:
+To use the Google cloud Storage as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
-```json
-"firehose" : {
-"type" : "hdfs",
-"paths": "/foo/bar,/foo/baz"
-}
+|Property|Possible Values|Description|Default|
+||---|---|---|
+|`druid.storage.type`|hdfs||Must be set.|
+|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the 
deep storage|Must be set.|
+
+All services that need to access GCS need to have the [GCS connector 
jar](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
 in their class path.
+One option is to place this jar in `${DRUID_HOME}/lib/` and 
`${DRUID_HOME}/extensions/druid-hdfs-storage/`.
+
+Finally, you need to add the below properties in the `core-site.xml`.
+For 

[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-15 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367141503
 
 

 ##
 File path: docs/development/extensions-core/hdfs.md
 ##
 @@ -36,49 +36,105 @@ To use this Apache Druid extension, make sure to 
[include](../../development/ext
 |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal 
user name |empty|
 
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path
 to keytab file|empty|
 
-If you are using the Hadoop indexer, set your output directory to be a 
location on Hadoop and it will work.
+Besides the above settings, you also need to include all Hadoop configuration 
files (such as `core-site.xml`, `hdfs-site.xml`)
+in the Druid classpath. One way to do this is copying all those files under 
`${DRUID_HOME}/conf/_common`.
+
+If you are using the Hadoop ingestion, set your output directory to be a 
location on Hadoop and it will work.
 If you want to eagerly authenticate against a secured hadoop/hdfs cluster you 
must set `druid.hadoop.security.kerberos.principal` and 
`druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job 
method that runs `kinit` command periodically.
 
-### Configuration for Google Cloud Storage
+### Configuration for Cloud Storage
+
+You can also use the AWS S3 or the Google Cloud Storage as the deep storage 
via HDFS.
+
+ Configuration for AWS S3
 
-The HDFS extension can also be used for GCS as deep storage.
+To use the AWS S3 as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
 |Property|Possible Values|Description|Default|
 ||---|---|---|
-|`druid.storage.type`|hdfs||Must be set.|
-|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.|
+|`druid.storage.type`|hdfs| |Must be set.|
+|`druid.storage.storageDirectory`|s3a://bucket/example/directory or 
s3n://bucket/example/directory|Path to the deep storage|Must be set.|
 
-All services that need to access GCS need to have the [GCS connector 
jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation)
 in their class path. One option is to place this jar in /lib/ and 
/extensions/druid-hdfs-storage/
+You also need to include the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html),
 especially the `hadoop-aws.jar` in the Druid classpath.
+Run the below command to install the `hadoop-aws.jar` file under 
`${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
 
-Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.
-
-
+```bash
+java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps 
-h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
+cp 
${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar
 ${DRUID_HOME}/extensions/druid-hdfs-storage/
+```
 
-## Native batch ingestion
+Finally, you need to add the below properties in the `core-site.xml`.
+For more configurations, see the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
+
+```xml
+
+  fs.s3a.impl
+  org.apache.hadoop.fs.s3a.S3AFileSystem
+  The implementation class of the S3A Filesystem
+
+
+
+  fs.AbstractFileSystem.s3a.impl
+  org.apache.hadoop.fs.s3a.S3A
+  The implementation class of the S3A 
AbstractFileSystem.
+
+
+
+  fs.s3a.access.key
+  AWS access key ID. Omit for IAM role-based or provider-based 
authentication.
+  your access key
+
+
+
+  fs.s3a.secret.key
+  AWS secret key. Omit for IAM role-based or provider-based 
authentication.
+  your secret key
+
+```
 
-This firehose ingests events from a predefined list of files from a Hadoop 
filesystem.
-This firehose is _splittable_ and can be used by [native parallel index 
tasks](../../ingestion/native-batch.md#parallel-task).
-Since each split represents an HDFS file, each worker task of `index_parallel` 
will read an object.
+ Configuration for Google Cloud Storage
 
-Sample spec:
+To use the Google cloud Storage as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
-```json
-"firehose" : {
-"type" : "hdfs",
-"paths": "/foo/bar,/foo/baz"
-}
+|Property|Possible Values|Description|Default|
+||---|---|---|
+|`druid.storage.type`|hdfs||Must be set.|
+|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the 
deep storage|Must be set.|
+
+All services that need to access GCS need to have the [GCS connector 
jar](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
 in their class path.
+One option is to place this jar in `${DRUID_HOME}/lib/` and 
`${DRUID_HOME}/extensions/druid-hdfs-storage/`.
+
+Finally, you need to add the below properties in the `core-site.xml`.
+For 

[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-15 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367138675
 
 

 ##
 File path: docs/development/extensions-core/hdfs.md
 ##
 @@ -36,49 +36,105 @@ To use this Apache Druid extension, make sure to 
[include](../../development/ext
 |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal 
user name |empty|
 
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path
 to keytab file|empty|
 
-If you are using the Hadoop indexer, set your output directory to be a 
location on Hadoop and it will work.
+Besides the above settings, you also need to include all Hadoop configuration 
files (such as `core-site.xml`, `hdfs-site.xml`)
+in the Druid classpath. One way to do this is copying all those files under 
`${DRUID_HOME}/conf/_common`.
+
+If you are using the Hadoop ingestion, set your output directory to be a 
location on Hadoop and it will work.
 If you want to eagerly authenticate against a secured hadoop/hdfs cluster you 
must set `druid.hadoop.security.kerberos.principal` and 
`druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job 
method that runs `kinit` command periodically.
 
-### Configuration for Google Cloud Storage
+### Configuration for Cloud Storage
+
+You can also use the AWS S3 or the Google Cloud Storage as the deep storage 
via HDFS.
+
+ Configuration for AWS S3
 
 Review comment:
   Sorry, please ignore I see that the hadoop-aws module needs to be added - 
mentioned below 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

2020-01-15 Thread GitBox
suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367137986
 
 

 ##
 File path: docs/development/extensions-core/hdfs.md
 ##
 @@ -36,49 +36,105 @@ To use this Apache Druid extension, make sure to 
[include](../../development/ext
 |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal 
user name |empty|
 
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path
 to keytab file|empty|
 
-If you are using the Hadoop indexer, set your output directory to be a 
location on Hadoop and it will work.
+Besides the above settings, you also need to include all Hadoop configuration 
files (such as `core-site.xml`, `hdfs-site.xml`)
+in the Druid classpath. One way to do this is copying all those files under 
`${DRUID_HOME}/conf/_common`.
+
+If you are using the Hadoop ingestion, set your output directory to be a 
location on Hadoop and it will work.
 If you want to eagerly authenticate against a secured hadoop/hdfs cluster you 
must set `druid.hadoop.security.kerberos.principal` and 
`druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job 
method that runs `kinit` command periodically.
 
-### Configuration for Google Cloud Storage
+### Configuration for Cloud Storage
+
+You can also use the AWS S3 or the Google Cloud Storage as the deep storage 
via HDFS.
+
+ Configuration for AWS S3
 
 Review comment:
   Do I need to add the s3 extension for this support or is it bundled with the 
hdfs extension somehow?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org