jihoonson commented on a change in pull request #8257: Add support for parallel native indexing with shuffle for perfect rollup URL: https://github.com/apache/incubator-druid/pull/8257#discussion_r313664990
########## File path: docs/content/ingestion/native_tasks.md ########## @@ -34,35 +34,50 @@ To run either kind of native batch indexing task, write an ingestion spec as spe [`/druid/indexer/v1/task` endpoint on the Overlord](../operations/api-reference.html#tasks), or use the `post-index-task` script included with Druid. Parallel Index Task --------------------------------- +------------------- The Parallel Index Task is a task for parallel batch indexing. This task only uses Druid's resource and -doesn't depend on other external systems like Hadoop. This task currently works in a single phase without shuffling intermediate -data. `index_parallel` task is a supervisor task which basically generates multiple worker tasks and submits -them to Overlords. Each worker task reads input data and makes segments. Once they successfully generate segments for all -input, they report the generated segment list to the supervisor task. The supervisor task periodically checks the worker -task statuses. If one of them fails, it retries the failed task until the retrying number reaches the configured limit. -If all worker tasks succeed, then it collects the reported list of generated segments and publishes those segments at once. - -To use this task, the `firehose` in `ioConfig` should be _splittable_. If it's not, this task runs sequentially. The -current splittable fireshoses are [`LocalFirehose`](./firehose.html#localfirehose), [`IngestSegmentFirehose`](./firehose.html#ingestsegmentfirehose), [`HttpFirehose`](./firehose.html#httpfirehose) -, [`StaticS3Firehose`](../development/extensions-core/s3.html#statics3firehose), [`StaticAzureBlobStoreFirehose`](../development/extensions-contrib/azure.html#staticazureblobstorefirehose) -, [`StaticGoogleBlobStoreFirehose`](../development/extensions-contrib/google.html#staticgoogleblobstorefirehose), and [`StaticCloudFilesFirehose`](../development/extensions-contrib/cloudfiles.html#staticcloudfilesfirehose). - -The splittable firehose is responsible for generating _splits_. The supervisor task generates _worker task specs_ each of -which specifies a split and submits worker tasks using those specs. As a result, the number of worker tasks depends on +doesn't depend on other external systems like Hadoop. `index_parallel` task is a supervisor task which basically generates +multiple worker tasks and submits them to the Overlord. Each worker task reads input data and creates segments. Once they +successfully generate segments for all input data, they report the generated segment list to the supervisor task. +The supervisor task periodically checks the status of worker tasks. If one of them fails, it retries the failed task +until the number of retries reaches the configured limit. If all worker tasks succeed, then it publishes the reported segments at once. + +The parallel Index Task can run in two different modes depending on `forceGuaranteedRollup` in `tuningConfig`. +If `forceGuaranteedRollup` = false, it's executed in a single phase. In this mode, +each sub task creates segments individually and reports them to the supervisor task. + +If `forceGuaranteedRollup` = true, it's executed in two phases with data shuffle which is similar to [MapReduce](https://en.wikipedia.org/wiki/MapReduce). +In the first phase, each sub task partitions input data based on `segmentGranularity` (primary partition key) in `granaulritySpec` Review comment: Thanks, fixed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org