jihoonson commented on issue #8061: Native parallel batch indexing with shuffle
URL: 
https://github.com/apache/incubator-druid/issues/8061#issuecomment-544644224
 
 
   Ah, the partitionsSpec you used is the [hash-based 
partitioning](https://druid.apache.org/docs/latest/ingestion/hadoop.html#hash-based-partitioning).
 To use the [range 
partitioning](https://druid.apache.org/docs/latest/ingestion/hadoop.html#single-dimension-range-partitioning),
 the `type` of the partitionsSpec should be `single_dim` instead of `hashed`. 
This single-dimension range partitioning is supported only by the hadoop task 
for now and I believe the native parallel indexing task will support it in the 
next release. 
   
   > Hi @jihoonson, I split data into different segments by single dimension 
tenant_id for the multitenancy scene. In this way, I can get higher query 
performance that filters on the tenant_id dimension.
   
   I'm pretty surprised by this and wondering how big the performance gain was 
in your case. Sadly, Druid doesn't support segment pruning in brokers for 
hash-based partitioning for now (this is supported only for single-dimension 
range partitioning). That means, even though your segments are partitioned 
based on the hash value of `tenant_id`, the broker will send queries to all 
historicals having any segments overlapping with the query interval no matter 
what their hash value is. I guess, perhaps you could see some performance 
improvement when you filter on `tenant_id` maybe because of less branch 
misprediction. Can you share your performance benchmark result if you can?
   
   > But tenant data was skew, so the segment size was not perfect. For 
example, the max segment size was nearly 18GB but the min segment size was 5MB. 
Then perform queries on the 18GB segment ware more slower than that partitioned 
by all dimensions.
   
   One popular way to mitigate the data skewness is adding other columns to the 
partition key, so that segment partitioning can be more well balanced. This 
will corrupt the locality of data so I guess you may need to find a good 
combination of columns for partition key.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to