[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

GitBox Wed, 15 Jan 2020 14:30:38 -0800

suneet-s commented on a change in pull request #9171: Doc update for the new 
input source and the new input format
URL: https://github.com/apache/druid/pull/9171#discussion_r367140667


 ##########
 File path: docs/development/extensions-core/hdfs.md
 ##########
 @@ -36,49 +36,105 @@ To use this Apache Druid extension, make sure to 
[include](../../development/ext
 |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal 
user name |empty|
 
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path
 to keytab file|empty|
 
-If you are using the Hadoop indexer, set your output directory to be a 
location on Hadoop and it will work.
+Besides the above settings, you also need to include all Hadoop configuration 
files (such as `core-site.xml`, `hdfs-site.xml`)
+in the Druid classpath. One way to do this is copying all those files under 
`${DRUID_HOME}/conf/_common`.
+
+If you are using the Hadoop ingestion, set your output directory to be a 
location on Hadoop and it will work.
 If you want to eagerly authenticate against a secured hadoop/hdfs cluster you 
must set `druid.hadoop.security.kerberos.principal` and 
`druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job 
method that runs `kinit` command periodically.
 
-### Configuration for Google Cloud Storage
+### Configuration for Cloud Storage
+
+You can also use the AWS S3 or the Google Cloud Storage as the deep storage 
via HDFS.
+
+#### Configuration for AWS S3
 
-The HDFS extension can also be used for GCS as deep storage.
+To use the AWS S3 as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
 |Property|Possible Values|Description|Default|
 |--------|---------------|-----------|-------|
-|`druid.storage.type`|hdfs||Must be set.|
-|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.|
+|`druid.storage.type`|hdfs| |Must be set.|
+|`druid.storage.storageDirectory`|s3a://bucket/example/directory or 
s3n://bucket/example/directory|Path to the deep storage|Must be set.|
 
-All services that need to access GCS need to have the [GCS connector 
jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation)
 in their class path. One option is to place this jar in <druid>/lib/ and 
<druid>/extensions/druid-hdfs-storage/
+You also need to include the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html),
 especially the `hadoop-aws.jar` in the Druid classpath.
+Run the below command to install the `hadoop-aws.jar` file under 
`${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
 
-Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.
-
-<a name="firehose"></a>
+```bash
+java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps 
-h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
+cp 
${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar
 ${DRUID_HOME}/extensions/druid-hdfs-storage/
+```
 
-## Native batch ingestion
+Finally, you need to add the below properties in the `core-site.xml`.
+For more configurations, see the [Hadoop AWS 
module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
+
+```xml
+<property>
+  <name>fs.s3a.impl</name>
+  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
+  <description>The implementation class of the S3A Filesystem</description>
+</property>
+
+<property>
+  <name>fs.AbstractFileSystem.s3a.impl</name>
+  <value>org.apache.hadoop.fs.s3a.S3A</value>
+  <description>The implementation class of the S3A 
AbstractFileSystem.</description>
+</property>
+
+<property>
+  <name>fs.s3a.access.key</name>
+  <description>AWS access key ID. Omit for IAM role-based or provider-based 
authentication.</description>
+  <value>your access key</value>
+</property>
+
+<property>
+  <name>fs.s3a.secret.key</name>
+  <description>AWS secret key. Omit for IAM role-based or provider-based 
authentication.</description>
+  <value>your secret key</value>
+</property>
+```
 
-This firehose ingests events from a predefined list of files from a Hadoop 
filesystem.
-This firehose is _splittable_ and can be used by [native parallel index 
tasks](../../ingestion/native-batch.md#parallel-task).
-Since each split represents an HDFS file, each worker task of `index_parallel` 
will read an object.
+#### Configuration for Google Cloud Storage
 
-Sample spec:
+To use the Google cloud Storage as the deep storage, you need to configure 
`druid.storage.storageDirectory` properly.
 
-```json
-"firehose" : {
-    "type" : "hdfs",
-    "paths": "/foo/bar,/foo/baz"
-}
+|Property|Possible Values|Description|Default|
+|--------|---------------|-----------|-------|
+|`druid.storage.type`|hdfs||Must be set.|
+|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the 
deep storage|Must be set.|
+
+All services that need to access GCS need to have the [GCS connector 
jar](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
 in their class path.
+One option is to place this jar in `${DRUID_HOME}/lib/` and 
`${DRUID_HOME}/extensions/druid-hdfs-storage/`.
+
+Finally, you need to add the below properties in the `core-site.xml`.
+For more configurations, see the [instructions to configure 
Hadoop](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md#configure-hadoop),
+[GCS core 
default](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/conf/gcs-core-default.xml)
+and [GCS core 
template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml).
+
+```xml
+<property>
+  <name>fs.gs.impl</name>
+  <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
+  <description>The FileSystem for gs: (GCS) uris.</description>
+</property>
+
+<property>
+  <name>fs.AbstractFileSystem.gs.impl</name>
+  <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
+  <description>The AbstractFileSystem for gs: uris.</description>
+</property>
 ```
 
-This firehose provides caching and prefetching features. During native batch 
indexing, a firehose can be read twice if
-`intervals` are not specified, and, in this case, caching can be useful. 
Prefetching is preferred when direct scanning
-of files is slow.
-
-|Property|Description|Default|
-|--------|-----------|-------|
-|type|This should be `hdfs`.|none (required)|
-|paths|HDFS paths. Can be either a JSON array or comma-separated string of 
paths. Wildcards like `*` are supported in these paths.|none (required)|
-|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means 
disabling cache. Cached files are not removed until the ingestion task 
completes.|1073741824|
-|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means 
disabling prefetch. Prefetched files are removed immediately once they are 
read.|1073741824|
-|prefetchTriggerBytes|Threshold to trigger prefetching 
files.|maxFetchCapacityBytes / 2|
-|fetchTimeout|Timeout for fetching each file.|60000|
-|maxFetchRetry|Maximum number of retries for fetching each file.|3|
+Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.
+
+## Reading data from HDFS or Cloud Storage
+
+### Native batch ingestion
+
+The [HDFS input source](../../ingestion/native-batch.md#hdfs-input-source) is 
supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
+to read files directly from the HDFS Storage. However, we highly recommend to 
use a proper
 
 Review comment:
   What type of input source should I use instead of the hdfs input source? Why 
is this beneficial?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [druid] suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format

Reply via email to