jihoonson commented on issue #8061: Native parallel batch indexing with shuffle URL: https://github.com/apache/incubator-druid/issues/8061#issuecomment-544644224 Ah, the partitionsSpec you used is the [hash-based partitioning](https://druid.apache.org/docs/latest/ingestion/hadoop.html#hash-based-partitioning). To use the [range partitioning](https://druid.apache.org/docs/latest/ingestion/hadoop.html#single-dimension-range-partitioning), the `type` of the partitionsSpec should be `single_dim` instead of `hashed`. This single-dimension range partitioning is supported only by the hadoop task for now and I believe the native parallel indexing task will support it in the next release. > Hi @jihoonson, I split data into different segments by single dimension tenant_id for the multitenancy scene. In this way, I can get higher query performance that filters on the tenant_id dimension. I'm pretty surprised by this and wondering how big the performance gain was in your case. Sadly, Druid doesn't support segment pruning in brokers for hash-based partitioning for now (this is supported only for single-dimension range partitioning). That means, even though your segments are partitioned based on the hash value of `tenant_id`, the broker will send queries to all historicals having any segments overlapping with the query interval no matter what their hash value is. I guess, perhaps you could see some performance improvement when you filter on `tenant_id` maybe because of less branch misprediction. Can you share your performance benchmark result if you can? > But tenant data was skew, so the segment size was not perfect. For example, the max segment size was nearly 18GB but the min segment size was 5MB. Then perform queries on the 18GB segment ware more slower than that partitioned by all dimensions. One popular way to mitigate the data skewness is adding other columns to the partition key, so that segment partitioning can be more well balanced. This will corrupt the locality of data so I guess you may need to find a good combination of columns for partition key.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
