takaaki7 commented on issue #15140:
URL: https://github.com/apache/druid/issues/15140#issuecomment-1774015604

   "Normal sampling" referred to the feature of general sampling as found in 
other databases. For instance, BigQuery offers the following functionality:
   https://cloud.google.com/bigquery/docs/table-sampling
   
   Druid does not yet have such a feature.
   
   I would like the typical sampling features found in systems like BigQuery, 
but what I especially want is sampling with specific shards because it's useful 
for `COUNT(DISTINCT user_id)` (user_id is just a example of sharding key).
   
   
   Consider a method for calculating the number of unique users (UU) using a 
table that accumulates user behavior logs.
   
   Assume that the table contains 10 users, each user has 10 rows(Total rows = 
10*10=100).
   
   |user_id|event|
   |-------|-----|
   |user1|event1|
   |user1|event2|
   |user1|...|
   |user1|event10|
   |...|...|
   |user10|event10|
   
   And we want to use sampling target rows to improve performance.
   We might consider randomly scanning 10% (=10 rows) of the table's entries, 
calculating the number of distinct users, and then multiplying that number by 
10 to get an estimated result.
   However this method wouldn't give us an accurate estimation, mostly it gives 
overestimated result.
   
   The most favorable scenario is if all 10 randomly chosen rows belong to rows 
of just 1 user, which would mean multiplying the count by 10 would give us the 
correct result.
   
   But the likelihood of this scenario occurring is very rare, the probability 
is
   10 / 100C10 = 10 / 17,310,309,456,440
   (`nCm` means combination count of n and m.)
   
   Probabilities for other scenarios of distinct user counts can be calculated, 
as follows: (While I'm not entirely confident...)
   P(uu=2) = (10C2 * (20C10 - 2)) / 100C10
   P(uu=3) = (10C3 * (30C10 - 3C2 * 20C10)) / 100C10
   ...
   
   The probability of having 5 or fewer unique users in the sample is around 
10%. 
   So random sampling(I referred to as 'normal sampling') is not useful for 
counting unique user.
   
   So my proposal is, if rows were sharded by user_id, by targeting a specific 
shard for sampling, we could obtain a reliably approximate value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to