This is an automated email from the ASF dual-hosted git repository.
jonwei pushed a commit to branch 0.12.3
in repository https://gitbox.apache.org/repos/asf/incubator-druid.git
The following commit(s) were added to refs/heads/0.12.3 by this push:
new 1680c6e [Backport] Separate hadoop and native batch docs more (#6144)
1680c6e is described below
commit 1680c6eeaccda7abaaf02fbd49131e9db1b85551
Author: Jonathan Wei <[email protected]>
AuthorDate: Thu Aug 9 21:01:25 2018 -0700
[Backport] Separate hadoop and native batch docs more (#6144)
---
docs/content/ingestion/batch-ingestion.md | 333 +--------------------
.../ingestion/{batch-ingestion.md => hadoop.md} | 64 ++--
docs/content/ingestion/native-batch.md | 175 +++++++++++
docs/content/ingestion/tasks.md | 176 +----------
docs/content/toc.md | 2 +
5 files changed, 213 insertions(+), 537 deletions(-)
diff --git a/docs/content/ingestion/batch-ingestion.md
b/docs/content/ingestion/batch-ingestion.md
index 743ead5..fb5b4fc 100644
--- a/docs/content/ingestion/batch-ingestion.md
+++ b/docs/content/ingestion/batch-ingestion.md
@@ -6,338 +6,13 @@ layout: doc_page
Druid can load data from static files through a variety of methods described
here.
-## Hadoop-based Batch Ingestion
+## Native Batch Ingestion
-Hadoop-based batch ingestion in Druid is supported via a Hadoop-ingestion
task. These tasks can be posted to a running
-instance of a Druid [overlord](../design/indexing-service.html). A sample task
is shown below:
+Druid has built-in batch ingestion functionality. See
[here](../ingestion/native_tasks.html) for more info.
-```json
-{
- "type" : "index_hadoop",
- "spec" : {
- "dataSchema" : {
- "dataSource" : "wikipedia",
- "parser" : {
- "type" : "hadoopyString",
- "parseSpec" : {
- "format" : "json",
- "timestampSpec" : {
- "column" : "timestamp",
- "format" : "auto"
- },
- "dimensionsSpec" : {
- "dimensions":
["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
- "dimensionExclusions" : [],
- "spatialDimensions" : []
- }
- }
- },
- "metricsSpec" : [
- {
- "type" : "count",
- "name" : "count"
- },
- {
- "type" : "doubleSum",
- "name" : "added",
- "fieldName" : "added"
- },
- {
- "type" : "doubleSum",
- "name" : "deleted",
- "fieldName" : "deleted"
- },
- {
- "type" : "doubleSum",
- "name" : "delta",
- "fieldName" : "delta"
- }
- ],
- "granularitySpec" : {
- "type" : "uniform",
- "segmentGranularity" : "DAY",
- "queryGranularity" : "NONE",
- "intervals" : [ "2013-08-31/2013-09-01" ]
- }
- },
- "ioConfig" : {
- "type" : "hadoop",
- "inputSpec" : {
- "type" : "static",
- "paths" : "/MyDirectory/example/wikipedia_data.json"
- }
- },
- "tuningConfig" : {
- "type": "hadoop"
- }
- },
- "hadoopDependencyCoordinates": <my_hadoop_version>
-}
-```
+## Hadoop Batch Ingestion
-|property|description|required?|
-|--------|-----------|---------|
-|type|The task type, this should always be "index_hadoop".|yes|
-|spec|A Hadoop Index Spec. See [Batch
Ingestion](../ingestion/batch-ingestion.html)|yes|
-|hadoopDependencyCoordinates|A JSON array of Hadoop dependency coordinates
that Druid will use, this property will override the default Hadoop
coordinates. Once specified, Druid will look for those Hadoop dependencies from
the location specified by `druid.extensions.hadoopDependenciesDir`|no|
-|classpathPrefix|Classpath that will be pre-appended for the peon process.|no|
-
-also note that, druid automatically computes the classpath for hadoop job
containers that run in hadoop cluster. But, in case of conflicts between hadoop
and druid's dependencies, you can manually specify the classpath by setting
`druid.extensions.hadoopContainerDruidClasspath` property. See the extensions
config in [base druid configuration](../configuration/index.html).
-
-### DataSchema
-
-This field is required. See [Ingestion](../ingestion/index.html).
-
-### IOConfig
-
-This field is required.
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|type|String|This should always be 'hadoop'.|yes|
-|inputSpec|Object|A specification of where to pull the data in from. See
below.|yes|
-|segmentOutputPath|String|The path to dump segments into.|yes|
-|metadataUpdateSpec|Object|A specification of how to update the metadata for
the druid cluster these segments belong to.|yes|
-
-#### InputSpec specification
-
-There are multiple types of inputSpecs:
-
-##### `static`
-
-A type of inputSpec where a static path to the data files is provided.
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|paths|Array of String|A String of input paths indicating where the raw data
is located.|yes|
-
-For example, using the static input paths:
-
-```
-"paths" :
"s3n://billy-bucket/the/data/is/here/data.gz,s3n://billy-bucket/the/data/is/here/moredata.gz,s3n://billy-bucket/the/data/is/here/evenmoredata.gz"
-```
-
-##### `granularity`
-
-A type of inputSpec that expects data to be organized in directories according
to datetime using the path format: `y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX` (where
date is represented by lowercase and time is represented by uppercase).
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|dataGranularity|String|Specifies the granularity to expect the data at, e.g.
hour means to expect directories `y=XXXX/m=XX/d=XX/H=XX`.|yes|
-|inputPath|String|Base path to append the datetime path to.|yes|
-|filePattern|String|Pattern that files should match to be included.|yes|
-|pathFormat|String|Joda datetime format for each directory. Default value is
`"'y'=yyyy/'m'=MM/'d'=dd/'H'=HH"`, or see [Joda
documentation](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html)|no|
-
-For example, if the sample config were run with the interval
2012-06-01/2012-06-02, it would expect data at the paths:
-
-```
-s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=00
-s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
-...
-s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
-```
-
-##### `dataSource`
-
-Read Druid segments. See [here](../ingestion/update-existing-data.html) for
more information.
-
-##### `multi`
-
-Read multiple sources of data. See
[here](../ingestion/update-existing-data.html) for more information.
-
-### TuningConfig
-
-The tuningConfig is optional and default parameters will be used if no
tuningConfig is specified.
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|workingPath|String|The working path to use for intermediate results (results
between Hadoop jobs).|no (default == '/tmp/druid-indexing')|
-|version|String|The version of created segments. Ignored for HadoopIndexTask
unless useExplicitVersion is set to true|no (default == datetime that indexing
starts at)|
-|partitionsSpec|Object|A specification of how to partition each time bucket
into segments. Absence of this property means no partitioning will occur. See
'Partitioning specification' below.|no (default == 'hashed')|
-|maxRowsInMemory|Integer|The number of rows to aggregate before persisting.
Note that this is the number of post-aggregation rows which may not be equal to
the number of input events due to roll-up. This is used to manage the required
JVM heap size.|no (default == 75000)|
-|leaveIntermediate|Boolean|Leave behind intermediate files (for debugging) in
the workingPath when a job completes, whether it passes or fails.|no (default
== false)|
-|cleanupOnFailure|Boolean|Clean up intermediate files when a job fails (unless
leaveIntermediate is on).|no (default == true)|
-|overwriteFiles|Boolean|Override existing files found during indexing.|no
(default == false)|
-|ignoreInvalidRows|Boolean|Ignore rows found to have problems.|no (default ==
false)|
-|combineText|Boolean|Use CombineTextInputFormat to combine multiple files into
a file split. This can speed up Hadoop jobs when processing a large number of
small files.|no (default == false)|
-|useCombiner|Boolean|Use Hadoop combiner to merge rows at mapper if
possible.|no (default == false)|
-|jobProperties|Object|A map of properties to add to the Hadoop job
configuration, see below for details.|no (default == null)|
-|indexSpec|Object|Tune how data is indexed. See below for more information.|no|
-|numBackgroundPersistThreads|Integer|The number of new background threads to
use for incremental persists. Using this feature causes a notable increase in
memory pressure and cpu usage but will make the job finish more quickly. If
changing from the default of 0 (use current thread for persists), we recommend
setting it to 1.|no (default == 0)|
-|forceExtendableShardSpecs|Boolean|Forces use of extendable shardSpecs.
Experimental feature intended for use with the [Kafka indexing service
extension](../development/extensions-core/kafka-ingestion.html).|no (default =
false)|
-|useExplicitVersion|Boolean|Forces HadoopIndexTask to use version.|no (default
= false)|
-
-#### jobProperties field of TuningConfig
-
-```json
- "tuningConfig" : {
- "type": "hadoop",
- "jobProperties": {
- "<hadoop-property-a>": "<value-a>",
- "<hadoop-property-b>": "<value-b>"
- }
- }
-```
-
-Hadoop's [MapReduce
documentation](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml)
lists the possible configuration parameters.
-
-With some Hadoop distributions, it may be necessary to set
`mapreduce.job.classpath` or `mapreduce.job.user.classpath.first`
-to avoid class loading issues. See the [working with different Hadoop versions
documentation](../operations/other-hadoop.html)
-for more details.
-
-#### IndexSpec
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|bitmap|Object|Compression format for bitmap indexes. Should be a JSON object;
see below for options.|no (defaults to Concise)|
-|dimensionCompression|String|Compression format for dimension columns. Choose
from `LZ4`, `LZF`, or `uncompressed`.|no (default == `LZ4`)|
-|metricCompression|String|Compression format for metric columns. Choose from
`LZ4`, `LZF`, `uncompressed`, or `none`.|no (default == `LZ4`)|
-|longEncoding|String|Encoding format for metric and dimension columns with
type long. Choose from `auto` or `longs`. `auto` encodes the values using
offset or lookup table depending on column cardinality, and store them with
variable size. `longs` stores the value as is with 8 bytes each.|no (default ==
`longs`)|
-
-##### Bitmap types
-
-For Concise bitmaps:
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|type|String|Must be `concise`.|yes|
-
-For Roaring bitmaps:
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|type|String|Must be `roaring`.|yes|
-|compressRunOnSerialization|Boolean|Use a run-length encoding where it is
estimated as more space efficient.|no (default == `true`)|
-
-### Partitioning specification
-
-Segments are always partitioned based on timestamp (according to the
granularitySpec) and may be further partitioned in
-some other way depending on partition type. Druid supports two types of
partitioning strategies: "hashed" (based on the
-hash of all dimensions in each row), and "dimension" (based on ranges of a
single dimension).
-
-Hashed partitioning is recommended in most cases, as it will improve indexing
performance and create more uniformly
-sized data segments relative to single-dimension partitioning.
-
-#### Hash-based partitioning
-
-```json
- "partitionsSpec": {
- "type": "hashed",
- "targetPartitionSize": 5000000
- }
-```
-
-Hashed partitioning works by first selecting a number of segments, and then
partitioning rows across those segments
-according to the hash of all dimensions in each row. The number of segments is
determined automatically based on the
-cardinality of the input set and a target partition size.
-
-The configuration options are:
-
-|Field|Description|Required|
-|--------|-----------|---------|
-|type|Type of partitionSpec to be used.|"hashed"|
-|targetPartitionSize|Target number of rows to include in a partition, should
be a number that targets segments of 500MB\~1GB.|either this or numShards|
-|numShards|Specify the number of partitions directly, instead of a target
partition size. Ingestion will run faster, since it can skip the step necessary
to select a number of partitions automatically.|either this or
targetPartitionSize|
-|partitionDimensions|The dimensions to partition on. Leave blank to select all
dimensions. Only used with numShards, will be ignored when targetPartitionSize
is set|no|
-
-#### Single-dimension partitioning
-
-```json
- "partitionsSpec": {
- "type": "dimension",
- "targetPartitionSize": 5000000
- }
-```
-
-Single-dimension partitioning works by first selecting a dimension to
partition on, and then separating that dimension
-into contiguous ranges. Each segment will contain all rows with values of that
dimension in that range. For example,
-your segments may be partitioned on the dimension "host" using the ranges
"a.example.com" to "f.example.com" and
-"f.example.com" to "z.example.com". By default, the dimension to use is
determined automatically, although you can
-override it with a specific dimension.
-
-The configuration options are:
-
-|Field|Description|Required|
-|--------|-----------|---------|
-|type|Type of partitionSpec to be used.|"dimension"|
-|targetPartitionSize|Target number of rows to include in a partition, should
be a number that targets segments of 500MB\~1GB.|yes|
-|maxPartitionSize|Maximum number of rows to include in a partition. Defaults
to 50% larger than the targetPartitionSize.|no|
-|partitionDimension|The dimension to partition on. Leave blank to select a
dimension automatically.|no|
-|assumeGrouped|Assume that input data has already been grouped on time and
dimensions. Ingestion will run faster, but may choose sub-optimal partitions if
this assumption is violated.|no|
-
-### Remote Hadoop Cluster
-
-If you have a remote Hadoop cluster, make sure to include the folder holding
your configuration `*.xml` files in your Druid `_common` configuration folder.
-
-If you are having dependency problems with your version of Hadoop and the
version compiled with Druid, please see [these
docs](../operations/other-hadoop.html).
-
-### Using Elastic MapReduce
-
-If your cluster is running on Amazon Web Services, you can use Elastic
MapReduce (EMR) to index data
-from S3. To do this:
-
-- Create a persistent, [long-running
cluster](http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-longrunning-transient.html).
-- When creating your cluster, enter the following configuration. If you're
using the wizard, this
-should be in advanced mode under "Edit software settings":
-
-```
-classification=yarn-site,properties=[mapreduce.reduce.memory.mb=6144,mapreduce.reduce.java.opts=-server
-Xms2g -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps,mapreduce.map.java.opts=758,mapreduce.map.java.opts=-server
-Xms512m -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.task.timeout=1800000]
-```
-
-- Follow the instructions under "[Configure Hadoop for data
-loads](../tutorials/cluster.html#configure-cluster-for-hadoop-data-loads)"
using the XML files from
-`/etc/hadoop/conf` on your EMR master.
-
-### Secured Hadoop Cluster
-
-By default druid can use the exisiting TGT kerberos ticket available in local
kerberos key cache.
-Although TGT ticket has a limited life cycle,
-therefore you need to call `kinit` command periodically to ensure validity of
TGT ticket.
-To avoid this extra external cron job script calling `kinit` periodically,
- you can provide the principal name and keytab location and druid will do the
authentication transparently at startup and job launching time.
-
-|Property|Possible Values|Description|Default|
-|--------|---------------|-----------|-------|
-|`druid.hadoop.security.kerberos.principal`|`[email protected]`| Principal
user name |empty|
-|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path
to keytab file|empty|
-
-#### Loading from S3 with EMR
-
-- In the `jobProperties` field in the `tuningConfig` section of your Hadoop
indexing task, add:
-
-```
-"jobProperties" : {
- "fs.s3.awsAccessKeyId" : "YOUR_ACCESS_KEY",
- "fs.s3.awsSecretAccessKey" : "YOUR_SECRET_KEY",
- "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
- "fs.s3n.awsAccessKeyId" : "YOUR_ACCESS_KEY",
- "fs.s3n.awsSecretAccessKey" : "YOUR_SECRET_KEY",
- "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
- "io.compression.codecs" :
"org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
-}
-```
-
-Note that this method uses Hadoop's built-in S3 filesystem rather than
Amazon's EMRFS, and is not compatible
-with Amazon-specific features such as S3 encryption and consistent views. If
you need to use these
-features, you will need to make the Amazon EMR Hadoop JARs available to Druid
through one of the
-mechanisms described in the [Using other Hadoop
distributions](#using-other-hadoop-distributions) section.
-
-## Using other Hadoop distributions
-
-Druid works out of the box with many Hadoop distributions.
-
-If you are having dependency conflicts between Druid and your version of
Hadoop, you can try
-searching for a solution in the [Druid user
groups](https://groups.google.com/forum/#!forum/druid-
-user), or reading the Druid [Different Hadoop
Versions](../operations/other-hadoop.html) documentation.
-
-## Command Line Hadoop Indexer
-
-If you don't want to use a full indexing service to use Hadoop to get data
into Druid, you can also use the standalone command line Hadoop indexer.
-See [here](../ingestion/command-line-hadoop-indexer.html) for more info.
-
-## IndexTask-based Batch Ingestion
-
-If you do not want to have a dependency on Hadoop for batch ingestion, you can
also use the index task. This task will be much slower and less scalable than
the Hadoop-based method. See [here](../ingestion/tasks.html) for more info.
+Hadoop can be used for batch ingestion. The Hadoop-based batch ingestion will
be faster and more scalable than the native batch ingestion. See
[here](../ingestion/hadoop.html) for more details.
Having Problems?
----------------
diff --git a/docs/content/ingestion/batch-ingestion.md
b/docs/content/ingestion/hadoop.md
similarity index 93%
copy from docs/content/ingestion/batch-ingestion.md
copy to docs/content/ingestion/hadoop.md
index 743ead5..ac2cf7d 100644
--- a/docs/content/ingestion/batch-ingestion.md
+++ b/docs/content/ingestion/hadoop.md
@@ -2,14 +2,19 @@
layout: doc_page
---
-# Batch Data Ingestion
+# Hadoop-based Batch Ingestion
-Druid can load data from static files through a variety of methods described
here.
+Hadoop-based batch ingestion in Druid is supported via a Hadoop-ingestion
task. These tasks can be posted to a running
+instance of a Druid [overlord](../design/indexing-service.html).
-## Hadoop-based Batch Ingestion
+## Command Line Hadoop Indexer
-Hadoop-based batch ingestion in Druid is supported via a Hadoop-ingestion
task. These tasks can be posted to a running
-instance of a Druid [overlord](../design/indexing-service.html). A sample task
is shown below:
+If you don't want to use a full indexing service to use Hadoop to get data
into Druid, you can also use the standalone command line Hadoop indexer.
+See [here](../ingestion/command-line-hadoop-indexer.html) for more info.
+
+## Task syntax
+
+A sample task is shown below:
```json
{
@@ -84,11 +89,11 @@ instance of a Druid
[overlord](../design/indexing-service.html). A sample task i
also note that, druid automatically computes the classpath for hadoop job
containers that run in hadoop cluster. But, in case of conflicts between hadoop
and druid's dependencies, you can manually specify the classpath by setting
`druid.extensions.hadoopContainerDruidClasspath` property. See the extensions
config in [base druid configuration](../configuration/index.html).
-### DataSchema
+## DataSchema
This field is required. See [Ingestion](../ingestion/index.html).
-### IOConfig
+## IOConfig
This field is required.
@@ -99,11 +104,11 @@ This field is required.
|segmentOutputPath|String|The path to dump segments into.|yes|
|metadataUpdateSpec|Object|A specification of how to update the metadata for
the druid cluster these segments belong to.|yes|
-#### InputSpec specification
+### InputSpec specification
There are multiple types of inputSpecs:
-##### `static`
+#### `static`
A type of inputSpec where a static path to the data files is provided.
@@ -117,7 +122,7 @@ For example, using the static input paths:
"paths" :
"s3n://billy-bucket/the/data/is/here/data.gz,s3n://billy-bucket/the/data/is/here/moredata.gz,s3n://billy-bucket/the/data/is/here/evenmoredata.gz"
```
-##### `granularity`
+#### `granularity`
A type of inputSpec that expects data to be organized in directories according
to datetime using the path format: `y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX` (where
date is represented by lowercase and time is represented by uppercase).
@@ -137,15 +142,15 @@ s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
```
-##### `dataSource`
+#### `dataSource`
Read Druid segments. See [here](../ingestion/update-existing-data.html) for
more information.
-##### `multi`
+#### `multi`
Read multiple sources of data. See
[here](../ingestion/update-existing-data.html) for more information.
-### TuningConfig
+## TuningConfig
The tuningConfig is optional and default parameters will be used if no
tuningConfig is specified.
@@ -167,7 +172,7 @@ The tuningConfig is optional and default parameters will be
used if no tuningCon
|forceExtendableShardSpecs|Boolean|Forces use of extendable shardSpecs.
Experimental feature intended for use with the [Kafka indexing service
extension](../development/extensions-core/kafka-ingestion.html).|no (default =
false)|
|useExplicitVersion|Boolean|Forces HadoopIndexTask to use version.|no (default
= false)|
-#### jobProperties field of TuningConfig
+### jobProperties field of TuningConfig
```json
"tuningConfig" : {
@@ -185,7 +190,7 @@ With some Hadoop distributions, it may be necessary to set
`mapreduce.job.classp
to avoid class loading issues. See the [working with different Hadoop versions
documentation](../operations/other-hadoop.html)
for more details.
-#### IndexSpec
+### IndexSpec
|Field|Type|Description|Required|
|-----|----|-----------|--------|
@@ -194,7 +199,7 @@ for more details.
|metricCompression|String|Compression format for metric columns. Choose from
`LZ4`, `LZF`, `uncompressed`, or `none`.|no (default == `LZ4`)|
|longEncoding|String|Encoding format for metric and dimension columns with
type long. Choose from `auto` or `longs`. `auto` encodes the values using
offset or lookup table depending on column cardinality, and store them with
variable size. `longs` stores the value as is with 8 bytes each.|no (default ==
`longs`)|
-##### Bitmap types
+#### Bitmap types
For Concise bitmaps:
@@ -209,7 +214,7 @@ For Roaring bitmaps:
|type|String|Must be `roaring`.|yes|
|compressRunOnSerialization|Boolean|Use a run-length encoding where it is
estimated as more space efficient.|no (default == `true`)|
-### Partitioning specification
+## Partitioning specification
Segments are always partitioned based on timestamp (according to the
granularitySpec) and may be further partitioned in
some other way depending on partition type. Druid supports two types of
partitioning strategies: "hashed" (based on the
@@ -218,7 +223,7 @@ hash of all dimensions in each row), and "dimension" (based
on ranges of a singl
Hashed partitioning is recommended in most cases, as it will improve indexing
performance and create more uniformly
sized data segments relative to single-dimension partitioning.
-#### Hash-based partitioning
+### Hash-based partitioning
```json
"partitionsSpec": {
@@ -240,7 +245,7 @@ The configuration options are:
|numShards|Specify the number of partitions directly, instead of a target
partition size. Ingestion will run faster, since it can skip the step necessary
to select a number of partitions automatically.|either this or
targetPartitionSize|
|partitionDimensions|The dimensions to partition on. Leave blank to select all
dimensions. Only used with numShards, will be ignored when targetPartitionSize
is set|no|
-#### Single-dimension partitioning
+### Single-dimension partitioning
```json
"partitionsSpec": {
@@ -265,13 +270,13 @@ The configuration options are:
|partitionDimension|The dimension to partition on. Leave blank to select a
dimension automatically.|no|
|assumeGrouped|Assume that input data has already been grouped on time and
dimensions. Ingestion will run faster, but may choose sub-optimal partitions if
this assumption is violated.|no|
-### Remote Hadoop Cluster
+## Remote Hadoop Cluster
If you have a remote Hadoop cluster, make sure to include the folder holding
your configuration `*.xml` files in your Druid `_common` configuration folder.
If you are having dependency problems with your version of Hadoop and the
version compiled with Druid, please see [these
docs](../operations/other-hadoop.html).
-### Using Elastic MapReduce
+## Using Elastic MapReduce
If your cluster is running on Amazon Web Services, you can use Elastic
MapReduce (EMR) to index data
from S3. To do this:
@@ -288,7 +293,7 @@
classification=yarn-site,properties=[mapreduce.reduce.memory.mb=6144,mapreduce.r
loads](../tutorials/cluster.html#configure-cluster-for-hadoop-data-loads)"
using the XML files from
`/etc/hadoop/conf` on your EMR master.
-### Secured Hadoop Cluster
+## Secured Hadoop Cluster
By default druid can use the exisiting TGT kerberos ticket available in local
kerberos key cache.
Although TGT ticket has a limited life cycle,
@@ -301,7 +306,7 @@ To avoid this extra external cron job script calling
`kinit` periodically,
|`druid.hadoop.security.kerberos.principal`|`[email protected]`| Principal
user name |empty|
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path
to keytab file|empty|
-#### Loading from S3 with EMR
+## Loading from S3 with EMR
- In the `jobProperties` field in the `tuningConfig` section of your Hadoop
indexing task, add:
@@ -329,16 +334,3 @@ Druid works out of the box with many Hadoop distributions.
If you are having dependency conflicts between Druid and your version of
Hadoop, you can try
searching for a solution in the [Druid user
groups](https://groups.google.com/forum/#!forum/druid-
user), or reading the Druid [Different Hadoop
Versions](../operations/other-hadoop.html) documentation.
-
-## Command Line Hadoop Indexer
-
-If you don't want to use a full indexing service to use Hadoop to get data
into Druid, you can also use the standalone command line Hadoop indexer.
-See [here](../ingestion/command-line-hadoop-indexer.html) for more info.
-
-## IndexTask-based Batch Ingestion
-
-If you do not want to have a dependency on Hadoop for batch ingestion, you can
also use the index task. This task will be much slower and less scalable than
the Hadoop-based method. See [here](../ingestion/tasks.html) for more info.
-
-Having Problems?
-----------------
-Getting data into Druid can definitely be difficult for first time users.
Please don't hesitate to ask questions in our IRC channel or on our [google
groups page](https://groups.google.com/forum/#!forum/druid-user).
diff --git a/docs/content/ingestion/native-batch.md
b/docs/content/ingestion/native-batch.md
new file mode 100644
index 0000000..ae8de38
--- /dev/null
+++ b/docs/content/ingestion/native-batch.md
@@ -0,0 +1,175 @@
+---
+layout: doc_page
+---
+
+# Native batch ingestion
+
+The "Index Task" is Druid's native batch ingestion mechanism. The task
executes within the indexing service and does not require an external Hadoop
setup to use. The grammar of the index task is as follows:
+
+```json
+{
+ "type" : "index",
+ "spec" : {
+ "dataSchema" : {
+ "dataSource" : "wikipedia",
+ "parser" : {
+ "type" : "string",
+ "parseSpec" : {
+ "format" : "json",
+ "timestampSpec" : {
+ "column" : "timestamp",
+ "format" : "auto"
+ },
+ "dimensionsSpec" : {
+ "dimensions":
["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
+ "dimensionExclusions" : [],
+ "spatialDimensions" : []
+ }
+ }
+ },
+ "metricsSpec" : [
+ {
+ "type" : "count",
+ "name" : "count"
+ },
+ {
+ "type" : "doubleSum",
+ "name" : "added",
+ "fieldName" : "added"
+ },
+ {
+ "type" : "doubleSum",
+ "name" : "deleted",
+ "fieldName" : "deleted"
+ },
+ {
+ "type" : "doubleSum",
+ "name" : "delta",
+ "fieldName" : "delta"
+ }
+ ],
+ "granularitySpec" : {
+ "type" : "uniform",
+ "segmentGranularity" : "DAY",
+ "queryGranularity" : "NONE",
+ "intervals" : [ "2013-08-31/2013-09-01" ]
+ }
+ },
+ "ioConfig" : {
+ "type" : "index",
+ "firehose" : {
+ "type" : "local",
+ "baseDir" : "examples/indexing/",
+ "filter" : "wikipedia_data.json"
+ }
+ },
+ "tuningConfig" : {
+ "type" : "index",
+ "targetPartitionSize" : 5000000,
+ "maxRowsInMemory" : 75000
+ }
+ }
+}
+```
+
+## Task Properties
+
+|property|description|required?|
+|--------|-----------|---------|
+|type|The task type, this should always be "index".|yes|
+|id|The task ID. If this is not explicitly specified, Druid generates the task
ID using task type, data source name, interval, and date-time stamp. |no|
+|spec|The ingestion spec including the data schema, IOConfig, and
TuningConfig. See below for more details. |yes|
+|context|Context containing various task configuration parameters. See below
for more details.|no|
+
+## Task Priority
+
+Druid's indexing tasks use locks for atomic data ingestion. Each lock is
acquired for the combination of a dataSource and an interval. Once a task
acquires a lock, it can write data for the dataSource and the interval of the
acquired lock unless the lock is released or preempted. Please see [the below
Locking section](#locking)
+
+Each task has a priority which is used for lock acquisition. The locks of
higher-priority tasks can preempt the locks of lower-priority tasks if they try
to acquire for the same dataSource and interval. If some locks of a task are
preempted, the behavior of the preempted task depends on the task
implementation. Usually, most tasks finish as failed if they are preempted.
+
+Tasks can have different default priorities depening on their types. Here are
a list of default priorities. Higher the number, higher the priority.
+
+|task type|default priority|
+|---------|----------------|
+|Realtime index task|75|
+|Batch index task|50|
+|Merge/Append/Compaction task|25|
+|Other tasks|0|
+
+You can override the task priority by setting your priority in the task
context like below.
+
+```json
+"context" : {
+ "priority" : 100
+}
+```
+
+## DataSchema
+
+This field is required.
+
+See [Ingestion](../ingestion/index.html)
+
+## IOConfig
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|type|The task type, this should always be "index".|none|yes|
+|firehose|Specify a [Firehose](../ingestion/firehose.html) here.|none|yes|
+|appendToExisting|Creates segments as additional shards of the latest version,
effectively appending to the segment set instead of replacing it. This will
only work if the existing segment set has extendable-type shardSpecs (which can
be forced by setting 'forceExtendableShardSpecs' in the tuning
config).|false|no|
+
+## TuningConfig
+
+The tuningConfig is optional and default parameters will be used if no
tuningConfig is specified. See below for more details.
+
+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|type|The task type, this should always be "index".|none|yes|
+|targetPartitionSize|Used in sharding. Determines how many rows are in each
segment.|5000000|no|
+|maxRowsInMemory|Used in determining when intermediate persists to disk should
occur.|75000|no|
+|maxTotalRows|Total number of rows in segments waiting for being published.
Used in determining when intermediate publish should occur.|150000|no|
+|numShards|Directly specify the number of shards to create. If this is
specified and 'intervals' is specified in the granularitySpec, the index task
can skip the determine intervals/partitions pass through the data. numShards
cannot be specified if targetPartitionSize is set.|null|no|
+|indexSpec|defines segment storage format options to be used at indexing time,
see [IndexSpec](#indexspec)|null|no|
+|maxPendingPersists|Maximum number of persists that can be pending but not
started. If this limit would be exceeded by a new intermediate persist,
ingestion will block until the currently-running persist finishes. Maximum heap
memory usage for indexing scales with maxRowsInMemory * (2 +
maxPendingPersists).|0 (meaning one persist can be running concurrently with
ingestion, and none can be queued up)|no|
+|forceExtendableShardSpecs|Forces use of extendable shardSpecs. Experimental
feature intended for use with the [Kafka indexing service
extension](../development/extensions-core/kafka-ingestion.html).|false|no|
+|forceGuaranteedRollup|Forces guaranteeing the [perfect
rollup](../design/index.html). The perfect rollup optimizes the total size of
generated segments and querying time while indexing time will be increased.
This flag cannot be used with either `appendToExisting` of IOConfig or
`forceExtendableShardSpecs`. For more details, see the below __Segment
publishing modes__ section.|false|no|
+|reportParseExceptions|If true, exceptions encountered during parsing will be
thrown and will halt ingestion; if false, unparseable rows and fields will be
skipped.|false|no|
+|publishTimeout|Milliseconds to wait for publishing segments. It must be >= 0,
where 0 means to wait forever.|0|no|
+|segmentWriteOutMediumFactory|Segment write-out medium to use when creating
segments. See [Indexing Service
Configuration](../configuration/indexing-service.html) page,
"SegmentWriteOutMediumFactory" section for explanation and available
options.|Not specified, the value from
`druid.peon.defaultSegmentWriteOutMediumFactory` is used|no|
+
+### IndexSpec
+
+The indexSpec defines segment storage format options to be used at indexing
time, such as bitmap type and column
+compression formats. The indexSpec is optional and default parameters will be
used if not specified.
+
+|Field|Type|Description|Required|
+|-----|----|-----------|--------|
+|bitmap|Object|Compression format for bitmap indexes. Should be a JSON object;
see below for options.|no (defaults to Concise)|
+|dimensionCompression|String|Compression format for dimension columns. Choose
from `LZ4`, `LZF`, or `uncompressed`.|no (default == `LZ4`)|
+|metricCompression|String|Compression format for metric columns. Choose from
`LZ4`, `LZF`, `uncompressed`, or `none`.|no (default == `LZ4`)|
+|longEncoding|String|Encoding format for metric and dimension columns with
type long. Choose from `auto` or `longs`. `auto` encodes the values using
offset or lookup table depending on column cardinality, and store them with
variable size. `longs` stores the value as is with 8 bytes each.|no (default ==
`longs`)|
+
+#### Bitmap types
+
+For Concise bitmaps:
+
+|Field|Type|Description|Required|
+|-----|----|-----------|--------|
+|type|String|Must be `concise`.|yes|
+
+For Roaring bitmaps:
+
+|Field|Type|Description|Required|
+|-----|----|-----------|--------|
+|type|String|Must be `roaring`.|yes|
+|compressRunOnSerialization|Boolean|Use a run-length encoding where it is
estimated as more space efficient.|no (default == `true`)|
+
+## Segment publishing modes
+
+While ingesting data using the Index task, it creates segments from the input
data and publishes them. For segment publishing, the Index task supports two
segment publishing modes, i.e., _bulk publishing mode_ and _incremental
publishing mode_ for [perfect rollup and best-effort
rollup](../design/index.html), respectively.
+
+In the bulk publishing mode, every segment is published at the very end of the
index task. Until then, created segments are stored in the memory and local
storage of the node running the index task. As a result, this mode might cause
a problem due to limited storage capacity, and is not recommended to use in
production.
+
+On the contrary, in the incremental publishing mode, segments are
incrementally published, that is they can be published in the middle of the
index task. More precisely, the index task collects data and stores created
segments in the memory and disks of the node running that task until the total
number of collected rows exceeds `maxTotalRows`. Once it exceeds, the index
task immediately publishes all segments created until that moment, cleans all
published segments up, and continues to i [...]
+
+To enable bulk publishing mode, `forceGuaranteedRollup` should be set in the
TuningConfig. Note that this option cannot be used with either
`forceExtendableShardSpecs` of TuningConfig or `appendToExisting` of IOConfig.
diff --git a/docs/content/ingestion/tasks.md b/docs/content/ingestion/tasks.md
index 4d424ef..5b8a1fe 100644
--- a/docs/content/ingestion/tasks.md
+++ b/docs/content/ingestion/tasks.md
@@ -9,181 +9,13 @@ There are several different types of tasks.
Segment Creation Tasks
----------------------
-### Hadoop Index Task
-
-See [batch ingestion](../ingestion/batch-ingestion.html).
-
-### Index Task
-
-The Index Task is a simpler variation of the Index Hadoop task that is
designed to be used for smaller data sets. The task executes within the
indexing service and does not require an external Hadoop setup to use. The
grammar of the index task is as follows:
-
-```json
-{
- "type" : "index",
- "spec" : {
- "dataSchema" : {
- "dataSource" : "wikipedia",
- "parser" : {
- "type" : "string",
- "parseSpec" : {
- "format" : "json",
- "timestampSpec" : {
- "column" : "timestamp",
- "format" : "auto"
- },
- "dimensionsSpec" : {
- "dimensions":
["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
- "dimensionExclusions" : [],
- "spatialDimensions" : []
- }
- }
- },
- "metricsSpec" : [
- {
- "type" : "count",
- "name" : "count"
- },
- {
- "type" : "doubleSum",
- "name" : "added",
- "fieldName" : "added"
- },
- {
- "type" : "doubleSum",
- "name" : "deleted",
- "fieldName" : "deleted"
- },
- {
- "type" : "doubleSum",
- "name" : "delta",
- "fieldName" : "delta"
- }
- ],
- "granularitySpec" : {
- "type" : "uniform",
- "segmentGranularity" : "DAY",
- "queryGranularity" : "NONE",
- "intervals" : [ "2013-08-31/2013-09-01" ]
- }
- },
- "ioConfig" : {
- "type" : "index",
- "firehose" : {
- "type" : "local",
- "baseDir" : "examples/indexing/",
- "filter" : "wikipedia_data.json"
- }
- },
- "tuningConfig" : {
- "type" : "index",
- "targetPartitionSize" : 5000000,
- "maxRowsInMemory" : 75000
- }
- }
-}
-```
-
-#### Task Properties
-
-|property|description|required?|
-|--------|-----------|---------|
-|type|The task type, this should always be "index".|yes|
-|id|The task ID. If this is not explicitly specified, Druid generates the task
ID using task type, data source name, interval, and date-time stamp. |no|
-|spec|The ingestion spec including the data schema, IOConfig, and
TuningConfig. See below for more details. |yes|
-|context|Context containing various task configuration parameters. See below
for more details.|no|
-
-#### Task Priority
-
-Druid's indexing tasks use locks for atomic data ingestion. Each lock is
acquired for the combination of a dataSource and an interval. Once a task
acquires a lock, it can write data for the dataSource and the interval of the
acquired lock unless the lock is released or preempted. Please see [the below
Locking section](#locking)
-
-Each task has a priority which is used for lock acquisition. The locks of
higher-priority tasks can preempt the locks of lower-priority tasks if they try
to acquire for the same dataSource and interval. If some locks of a task are
preempted, the behavior of the preempted task depends on the task
implementation. Usually, most tasks finish as failed if they are preempted.
-
-Tasks can have different default priorities depening on their types. Here are
a list of default priorities. Higher the number, higher the priority.
-
-|task type|default priority|
-|---------|----------------|
-|Realtime index task|75|
-|Batch index task|50|
-|Merge/Append/Compaction task|25|
-|Other tasks|0|
-
-You can override the task priority by setting your priority in the task
context like below.
-
-```json
-"context" : {
- "priority" : 100
-}
-```
-
-#### DataSchema
-
-This field is required.
+### Native Batch Indexing Task
-See [Ingestion](../ingestion/index.html)
+See [Native batch ingestion](../ingestion/native-batch.html).
-#### IOConfig
-
-|property|description|default|required?|
-|--------|-----------|-------|---------|
-|type|The task type, this should always be "index".|none|yes|
-|firehose|Specify a [Firehose](../ingestion/firehose.html) here.|none|yes|
-|appendToExisting|Creates segments as additional shards of the latest version,
effectively appending to the segment set instead of replacing it. This will
only work if the existing segment set has extendable-type shardSpecs (which can
be forced by setting 'forceExtendableShardSpecs' in the tuning
config).|false|no|
-
-#### TuningConfig
-
-The tuningConfig is optional and default parameters will be used if no
tuningConfig is specified. See below for more details.
-
-|property|description|default|required?|
-|--------|-----------|-------|---------|
-|type|The task type, this should always be "index".|none|yes|
-|targetPartitionSize|Used in sharding. Determines how many rows are in each
segment.|5000000|no|
-|maxRowsInMemory|Used in determining when intermediate persists to disk should
occur.|75000|no|
-|maxTotalRows|Total number of rows in segments waiting for being published.
Used in determining when intermediate publish should occur.|150000|no|
-|numShards|Directly specify the number of shards to create. If this is
specified and 'intervals' is specified in the granularitySpec, the index task
can skip the determine intervals/partitions pass through the data. numShards
cannot be specified if targetPartitionSize is set.|null|no|
-|indexSpec|defines segment storage format options to be used at indexing time,
see [IndexSpec](#indexspec)|null|no|
-|maxPendingPersists|Maximum number of persists that can be pending but not
started. If this limit would be exceeded by a new intermediate persist,
ingestion will block until the currently-running persist finishes. Maximum heap
memory usage for indexing scales with maxRowsInMemory * (2 +
maxPendingPersists).|0 (meaning one persist can be running concurrently with
ingestion, and none can be queued up)|no|
-|forceExtendableShardSpecs|Forces use of extendable shardSpecs. Experimental
feature intended for use with the [Kafka indexing service
extension](../development/extensions-core/kafka-ingestion.html).|false|no|
-|forceGuaranteedRollup|Forces guaranteeing the [perfect
rollup](../design/index.html). The perfect rollup optimizes the total size of
generated segments and querying time while indexing time will be increased.
This flag cannot be used with either `appendToExisting` of IOConfig or
`forceExtendableShardSpecs`. For more details, see the below __Segment
publishing modes__ section.|false|no|
-|reportParseExceptions|If true, exceptions encountered during parsing will be
thrown and will halt ingestion; if false, unparseable rows and fields will be
skipped.|false|no|
-|publishTimeout|Milliseconds to wait for publishing segments. It must be >= 0,
where 0 means to wait forever.|0|no|
-|segmentWriteOutMediumFactory|Segment write-out medium to use when creating
segments. See [Indexing Service
Configuration](../configuration/indexing-service.html) page,
"SegmentWriteOutMediumFactory" section for explanation and available
options.|Not specified, the value from
`druid.peon.defaultSegmentWriteOutMediumFactory` is used|no|
-
-#### IndexSpec
-
-The indexSpec defines segment storage format options to be used at indexing
time, such as bitmap type and column
-compression formats. The indexSpec is optional and default parameters will be
used if not specified.
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|bitmap|Object|Compression format for bitmap indexes. Should be a JSON object;
see below for options.|no (defaults to Concise)|
-|dimensionCompression|String|Compression format for dimension columns. Choose
from `LZ4`, `LZF`, or `uncompressed`.|no (default == `LZ4`)|
-|metricCompression|String|Compression format for metric columns. Choose from
`LZ4`, `LZF`, `uncompressed`, or `none`.|no (default == `LZ4`)|
-|longEncoding|String|Encoding format for metric and dimension columns with
type long. Choose from `auto` or `longs`. `auto` encodes the values using
offset or lookup table depending on column cardinality, and store them with
variable size. `longs` stores the value as is with 8 bytes each.|no (default ==
`longs`)|
-
-##### Bitmap types
-
-For Concise bitmaps:
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|type|String|Must be `concise`.|yes|
-
-For Roaring bitmaps:
-
-|Field|Type|Description|Required|
-|-----|----|-----------|--------|
-|type|String|Must be `roaring`.|yes|
-|compressRunOnSerialization|Boolean|Use a run-length encoding where it is
estimated as more space efficient.|no (default == `true`)|
-
-#### Segment publishing modes
-
-While ingesting data using the Index task, it creates segments from the input
data and publishes them. For segment publishing, the Index task supports two
segment publishing modes, i.e., _bulk publishing mode_ and _incremental
publishing mode_ for [perfect rollup and best-effort
rollup](./design/index.html), respectively.
-
-In the bulk publishing mode, every segment is published at the very end of the
index task. Until then, created segments are stored in the memory and local
storage of the node running the index task. As a result, this mode might cause
a problem due to limited storage capacity, and is not recommended to use in
production.
-
-On the contrary, in the incremental publishing mode, segments are
incrementally published, that is they can be published in the middle of the
index task. More precisely, the index task collects data and stores created
segments in the memory and disks of the node running that task until the total
number of collected rows exceeds `maxTotalRows`. Once it exceeds, the index
task immediately publishes all segments created until that moment, cleans all
published segments up, and continues to i [...]
+### Hadoop Index Task
-To enable bulk publishing mode, `forceGuaranteedRollup` should be set in the
TuningConfig. Note that this option cannot be used with either
`forceExtendableShardSpecs` of TuningConfig or `appendToExisting` of IOConfig.
+See [Hadoop batch ingestion](../ingestion/hadoop.html).
Segment Merging Tasks
---------------------
diff --git a/docs/content/toc.md b/docs/content/toc.md
index 585e13b..e914b6e 100644
--- a/docs/content/toc.md
+++ b/docs/content/toc.md
@@ -17,6 +17,8 @@ layout: toc
* [Schema Design](/docs/VERSION/ingestion/schema-design.html)
* [Schema Changes](/docs/VERSION/ingestion/schema-changes.html)
* [Batch File Ingestion](/docs/VERSION/ingestion/batch-ingestion.html)
+ * [Native Batch Ingestion](docs/VERSION/ingestion/native-batch.html)
+ * [Hadoop Batch Ingestion](docs/VERSION/ingestion/hadoop.html)
* [Stream Ingestion](/docs/VERSION/ingestion/stream-ingestion.html)
* [Stream Push](/docs/VERSION/ingestion/stream-push.html)
* [Stream Pull](/docs/VERSION/ingestion/stream-pull.html)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]