suneet-s commented on a change in pull request #9171: Doc update for the new input source and the new input format URL: https://github.com/apache/druid/pull/9171#discussion_r367139554
########## File path: docs/development/extensions-core/hdfs.md ########## @@ -36,49 +36,105 @@ To use this Apache Druid extension, make sure to [include](../../development/ext |`druid.hadoop.security.kerberos.principal`|`dr...@example.com`| Principal user name |empty| |`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty| -If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work. +Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`) +in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`. + +If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work. If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically. -### Configuration for Google Cloud Storage +### Configuration for Cloud Storage + +You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS. + +#### Configuration for AWS S3 -The HDFS extension can also be used for GCS as deep storage. +To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly. |Property|Possible Values|Description|Default| |--------|---------------|-----------|-------| -|`druid.storage.type`|hdfs||Must be set.| -|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.| +|`druid.storage.type`|hdfs| |Must be set.| +|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.| -All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in <druid>/lib/ and <druid>/extensions/druid-hdfs-storage/ +You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath. +Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes. -Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2. - -<a name="firehose"></a> +```bash +java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; +cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/ +``` -## Native batch ingestion +Finally, you need to add the below properties in the `core-site.xml`. +For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). + +```xml +<property> + <name>fs.s3a.impl</name> + <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> + <description>The implementation class of the S3A Filesystem</description> +</property> + +<property> + <name>fs.AbstractFileSystem.s3a.impl</name> + <value>org.apache.hadoop.fs.s3a.S3A</value> + <description>The implementation class of the S3A AbstractFileSystem.</description> +</property> + +<property> + <name>fs.s3a.access.key</name> + <description>AWS access key ID. Omit for IAM role-based or provider-based authentication.</description> + <value>your access key</value> +</property> + +<property> + <name>fs.s3a.secret.key</name> + <description>AWS secret key. Omit for IAM role-based or provider-based authentication.</description> + <value>your secret key</value> +</property> +``` -This firehose ingests events from a predefined list of files from a Hadoop filesystem. -This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task). -Since each split represents an HDFS file, each worker task of `index_parallel` will read an object. +#### Configuration for Google Cloud Storage -Sample spec: +To use the Google cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly. -```json -"firehose" : { - "type" : "hdfs", - "paths": "/foo/bar,/foo/baz" -} +|Property|Possible Values|Description|Default| +|--------|---------------|-----------|-------| +|`druid.storage.type`|hdfs||Must be set.| +|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the deep storage|Must be set.| + +All services that need to access GCS need to have the [GCS connector jar](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md) in their class path. +One option is to place this jar in `${DRUID_HOME}/lib/` and `${DRUID_HOME}/extensions/druid-hdfs-storage/`. + +Finally, you need to add the below properties in the `core-site.xml`. +For more configurations, see the [instructions to configure Hadoop](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md#configure-hadoop), +[GCS core default](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/conf/gcs-core-default.xml) +and [GCS core template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml). + +```xml +<property> + <name>fs.gs.impl</name> + <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value> + <description>The FileSystem for gs: (GCS) uris.</description> +</property> + +<property> + <name>fs.AbstractFileSystem.gs.impl</name> + <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value> + <description>The AbstractFileSystem for gs: uris.</description> +</property> ``` -This firehose provides caching and prefetching features. During native batch indexing, a firehose can be read twice if -`intervals` are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scanning -of files is slow. - -|Property|Description|Default| -|--------|-----------|-------| -|type|This should be `hdfs`.|none (required)| -|paths|HDFS paths. Can be either a JSON array or comma-separated string of paths. Wildcards like `*` are supported in these paths.|none (required)| -|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824| -|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824| -|prefetchTriggerBytes|Threshold to trigger prefetching files.|maxFetchCapacityBytes / 2| -|fetchTimeout|Timeout for fetching each file.|60000| -|maxFetchRetry|Maximum number of retries for fetching each file.|3| +Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2. Review comment: Is this still accurate? Have we done more recent tests? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org