ccaominh opened a new pull request #8925: Parallel indexing single dim 
partitions
URL: https://github.com/apache/incubator-druid/pull/8925
 
 
   ### Description
   
   Implements single dimension range partitioning for native parallel batch 
indexing as described in #8769. This initial version requires the 
druid-datasketches extension to be loaded.
   
   The algorithm has 5 phases that are orchestrated by the supervisor in 
`ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`. These 
phases and the main classes involved are described below:
   
   1) In parallel, determine the distribution of dimension values for each 
input source split.
   
      `PartialDimensionDistributionTask` uses `StringSketch` to generate the 
approximate distribution of dimension values for each input source split. If 
the rows are ungrouped, 
`PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter` uses a 
Bloom filter to skip rows that would be grouped. The final distribution is sent 
back to the supervisor via `DimensionDistributionReport`.
   
   2) The range partitions are determined.
   
      In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the 
supervisor uses `StringSketchMerger` to merge the individual `StringSketch`es 
created in the preceding phase. The merged sketch is then used to create the 
range partitions.
   
   3) In parallel, generate partial range-partitioned segments.
   
      `PartialRangeSegmentGenerateTask` uses the range partitions determined in 
the preceding phase and `RangePartitionCachingLocalSegmentAllocator` to 
generate `SingleDimensionShardSpec`s. The partition information is sent back to 
the supervisor via `GeneratedGenericPartitionsReport`.
   
   4) The partial range segments are grouped.
   
      In 
`ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`, the 
supervisor creates the `PartialGenericSegmentMergeIOConfig`s necessary for the 
next phase.
   
   5) In parallel, merge partial range-partitioned segments.
   
      `PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to 
retrieve the partial range-partitioned segments generated earlier and then 
merges and publishes them.
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths.
   - [x] added integration tests.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to