jihoonson commented on a change in pull request #8257: Add support for parallel 
native indexing with shuffle for perfect rollup
URL: https://github.com/apache/incubator-druid/pull/8257#discussion_r313664990
 
 

 ##########
 File path: docs/content/ingestion/native_tasks.md
 ##########
 @@ -34,35 +34,50 @@ To run either kind of native batch indexing task, write an 
ingestion spec as spe
 [`/druid/indexer/v1/task` endpoint on the 
Overlord](../operations/api-reference.html#tasks), or use the `post-index-task` 
script included with Druid.
 
 Parallel Index Task
---------------------------------
+-------------------
 
 The Parallel Index Task is a task for parallel batch indexing. This task only 
uses Druid's resource and
-doesn't depend on other external systems like Hadoop. This task currently 
works in a single phase without shuffling intermediate
-data. `index_parallel` task is a supervisor task which basically generates 
multiple worker tasks and submits
-them to Overlords. Each worker task reads input data and makes segments. Once 
they successfully generate segments for all
-input, they report the generated segment list to the supervisor task. The 
supervisor task periodically checks the worker
-task statuses. If one of them fails, it retries the failed task until the 
retrying number reaches the configured limit.
-If all worker tasks succeed, then it collects the reported list of generated 
segments and publishes those segments at once.
-
-To use this task, the `firehose` in `ioConfig` should be _splittable_. If it's 
not, this task runs sequentially. The
-current splittable fireshoses are 
[`LocalFirehose`](./firehose.html#localfirehose), 
[`IngestSegmentFirehose`](./firehose.html#ingestsegmentfirehose), 
[`HttpFirehose`](./firehose.html#httpfirehose)
-, 
[`StaticS3Firehose`](../development/extensions-core/s3.html#statics3firehose), 
[`StaticAzureBlobStoreFirehose`](../development/extensions-contrib/azure.html#staticazureblobstorefirehose)
-, 
[`StaticGoogleBlobStoreFirehose`](../development/extensions-contrib/google.html#staticgoogleblobstorefirehose),
 and 
[`StaticCloudFilesFirehose`](../development/extensions-contrib/cloudfiles.html#staticcloudfilesfirehose).
-
-The splittable firehose is responsible for generating _splits_. The supervisor 
task generates _worker task specs_ each of
-which specifies a split and submits worker tasks using those specs. As a 
result, the number of worker tasks depends on
+doesn't depend on other external systems like Hadoop. `index_parallel` task is 
a supervisor task which basically generates
+multiple worker tasks and submits them to the Overlord. Each worker task reads 
input data and creates segments. Once they
+successfully generate segments for all input data, they report the generated 
segment list to the supervisor task. 
+The supervisor task periodically checks the status of worker tasks. If one of 
them fails, it retries the failed task
+until the number of retries reaches the configured limit. If all worker tasks 
succeed, then it publishes the reported segments at once.
+
+The parallel Index Task can run in two different modes depending on 
`forceGuaranteedRollup` in `tuningConfig`.
+If `forceGuaranteedRollup` = false, it's executed in a single phase. In this 
mode,
+each sub task creates segments individually and reports them to the supervisor 
task.
+
+If `forceGuaranteedRollup` = true, it's executed in two phases with data 
shuffle which is similar to 
[MapReduce](https://en.wikipedia.org/wiki/MapReduce).
+In the first phase, each sub task partitions input data based on 
`segmentGranularity` (primary partition key) in `granaulritySpec`
 
 Review comment:
   Thanks, fixed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to