[GitHub] [incubator-druid] jihoonson commented on issue #8061: Native parallel batch indexing with shuffle

GitBox Mon, 21 Oct 2019 11:25:08 -0700

jihoonson commented on issue #8061: Native parallel batch indexing with shuffle
URL:
https://github.com/apache/incubator-druid/issues/8061#issuecomment-544644224

Ah, the partitionsSpec you used is the [hash-based
partitioning](https://druid.apache.org/docs/latest/ingestion/hadoop.html#hash-based-partitioning).
To use the [range
partitioning](https://druid.apache.org/docs/latest/ingestion/hadoop.html#single-dimension-range-partitioning),
the `type` of the partitionsSpec should be `single_dim` instead of `hashed`.
This single-dimension range partitioning is supported only by the hadoop task
for now and I believe the native parallel indexing task will support it in the
next release.

> Hi @jihoonson, I split data into different segments by single dimension
tenant_id for the multitenancy scene. In this way, I can get higher query
performance that filters on the tenant_id dimension.

I'm pretty surprised by this and wondering how big the performance gain was
in your case. Sadly, Druid doesn't support segment pruning in brokers for
hash-based partitioning for now (this is supported only for single-dimension
range partitioning). That means, even though your segments are partitioned
based on the hash value of `tenant_id`, the broker will send queries to all
historicals having any segments overlapping with the query interval no matter
what their hash value is. I guess, perhaps you could see some performance
improvement when you filter on `tenant_id` maybe because of less branch
misprediction. Can you share your performance benchmark result if you can?

> But tenant data was skew, so the segment size was not perfect. For
example, the max segment size was nearly 18GB but the min segment size was 5MB.
Then perform queries on the 18GB segment ware more slower than that partitioned
by all dimensions.

One popular way to mitigate the data skewness is adding other columns to the
partition key, so that segment partitioning can be more well balanced. This
will corrupt the locality of data so I guess you may need to find a good
combination of columns for partition key.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] jihoonson commented on issue #8061: Native parallel batch indexing with shuffle

Reply via email to