[I] [Proposal] Fixed-shard sampling (druid)

via GitHub Thu, 12 Oct 2023 02:09:48 -0700


takaaki7 opened a new issue, #15140:
URL: https://github.com/apache/druid/issues/15140


   
   I'd like to introduce a table sampling feature. Specifically, it would be 
beneficial if we could perform sampling with a fixed shard number. This would 
allow us to calculate the distinct count of data related to the key of the 
secondary partition with high accuracy.
   
   For instance, consider a table storing user activity data on a website as 
follows:
   ```
   Schema: __time, user_id, event_name, ...
   Partition: day
   Secondary partition: type=hash, key=user_id, shard_num=100
   ```
   
   In normal sampling, functions like distinct count(user_id) don't work. If we 
were to randomly select 50% of the rows from the table, the probability that at 
least one row with a particular user_id is included in the sample is not close 
to 50%. 
   
   However, what if we targeted only shards 0-49 out of the 100 segments? Given 
that the hash function generally distributes data uniformly, specific user_id 
may being part of the sample with a roughly 50%.
   
   Consider the following query, for example:
   ```
   SELECT 
   COUNT(DISTINCT user_id)
   FROM table TABLESAMPLE(percentage=50, type=fixed_shard)
   WHERE '2023-10-01T00:00:00Z' < __time AND __time < '2023-10-02T00:00:00Z'
   ```
   
   In the case where the segment files exist as:
   ```
   <2023-10-01T00:00:00Z>_0
   <2023-10-01T00:00:00Z>_1
   <2023-10-01T00:00:00Z>_...
   <2023-10-01T00:00:00Z>_100
   <2023-10-02T00:00:00Z>_0
   <2023-10-02T00:00:00Z>_1
   <2023-10-02T00:00:00Z>_...
   <2023-10-02T00:00:00Z>_100
   ```
   The idea is to target only `<2023-10-01T00:00:00Z>_<0-49>` and 
`<2023-10-02T00:00:00Z>_<0-49>`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Proposal] Fixed-shard sampling (druid)

Reply via email to