[GitHub] [incubator-druid] jihoonson commented on a change in pull request #8257: Add support for parallel native indexing with shuffle for perfect rollup

GitBox Tue, 13 Aug 2019 17:28:00 -0700

jihoonson commented on a change in pull request #8257: Add support for parallel 
native indexing with shuffle for perfect rollup
URL: https://github.com/apache/incubator-druid/pull/8257#discussion_r313664990

##########
File path: docs/content/ingestion/native_tasks.md
##########
@@ -34,35 +34,50 @@ To run either kind of native batch indexing task, write an
ingestion spec as spe
[`/druid/indexer/v1/task` endpoint on the
Overlord](../operations/api-reference.html#tasks), or use the `post-index-task`
script included with Druid.

Parallel Index Task
---------------------------------
+-------------------

The Parallel Index Task is a task for parallel batch indexing. This task only
uses Druid's resource and
-doesn't depend on other external systems like Hadoop. This task currently
works in a single phase without shuffling intermediate
-data. `index_parallel` task is a supervisor task which basically generates
multiple worker tasks and submits
-them to Overlords. Each worker task reads input data and makes segments. Once
they successfully generate segments for all
-input, they report the generated segment list to the supervisor task. The
supervisor task periodically checks the worker
-task statuses. If one of them fails, it retries the failed task until the
retrying number reaches the configured limit.
-If all worker tasks succeed, then it collects the reported list of generated
segments and publishes those segments at once.
-
-To use this task, the `firehose` in `ioConfig` should be _splittable_. If it's
not, this task runs sequentially. The
-current splittable fireshoses are
[`LocalFirehose`](./firehose.html#localfirehose),
[`IngestSegmentFirehose`](./firehose.html#ingestsegmentfirehose),
[`HttpFirehose`](./firehose.html#httpfirehose)
-,
[`StaticS3Firehose`](../development/extensions-core/s3.html#statics3firehose),
[`StaticAzureBlobStoreFirehose`](../development/extensions-contrib/azure.html#staticazureblobstorefirehose)
-,
[`StaticGoogleBlobStoreFirehose`](../development/extensions-contrib/google.html#staticgoogleblobstorefirehose),
and
[`StaticCloudFilesFirehose`](../development/extensions-contrib/cloudfiles.html#staticcloudfilesfirehose).
-
-The splittable firehose is responsible for generating _splits_. The supervisor
task generates _worker task specs_ each of
-which specifies a split and submits worker tasks using those specs. As a
result, the number of worker tasks depends on
+doesn't depend on other external systems like Hadoop. `index_parallel` task is
a supervisor task which basically generates
+multiple worker tasks and submits them to the Overlord. Each worker task reads
input data and creates segments. Once they
+successfully generate segments for all input data, they report the generated
segment list to the supervisor task.
+The supervisor task periodically checks the status of worker tasks. If one of
them fails, it retries the failed task
+until the number of retries reaches the configured limit. If all worker tasks
succeed, then it publishes the reported segments at once.
+
+The parallel Index Task can run in two different modes depending on
`forceGuaranteedRollup` in `tuningConfig`.
+If `forceGuaranteedRollup` = false, it's executed in a single phase. In this
mode,
+each sub task creates segments individually and reports them to the supervisor
task.
+
+If `forceGuaranteedRollup` = true, it's executed in two phases with data
shuffle which is similar to
[MapReduce](https://en.wikipedia.org/wiki/MapReduce).
+In the first phase, each sub task partitions input data based on
`segmentGranularity` (primary partition key) in `granaulritySpec`

Review comment:
Thanks, fixed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [incubator-druid] jihoonson commented on a change in pull request #8257: Add support for parallel native indexing with shuffle for perfect rollup

Reply via email to