takaaki7 opened a new issue, #15140: URL: https://github.com/apache/druid/issues/15140
I'd like to introduce a table sampling feature. Specifically, it would be beneficial if we could perform sampling with a fixed shard number. This would allow us to calculate the distinct count of data related to the key of the secondary partition with high accuracy. For instance, consider a table storing user activity data on a website as follows: ``` Schema: __time, user_id, event_name, ... Partition: day Secondary partition: type=hash, key=user_id, shard_num=100 ``` In normal sampling, functions like distinct count(user_id) don't work. If we were to randomly select 50% of the rows from the table, the probability that at least one row with a particular user_id is included in the sample is not close to 50%. However, what if we targeted only shards 0-49 out of the 100 segments? Given that the hash function generally distributes data uniformly, specific user_id may being part of the sample with a roughly 50%. Consider the following query, for example: ``` SELECT COUNT(DISTINCT user_id) FROM table TABLESAMPLE(percentage=50, type=fixed_shard) WHERE '2023-10-01T00:00:00Z' < __time AND __time < '2023-10-02T00:00:00Z' ``` In the case where the segment files exist as: ``` <2023-10-01T00:00:00Z>_0 <2023-10-01T00:00:00Z>_1 <2023-10-01T00:00:00Z>_... <2023-10-01T00:00:00Z>_100 <2023-10-02T00:00:00Z>_0 <2023-10-02T00:00:00Z>_1 <2023-10-02T00:00:00Z>_... <2023-10-02T00:00:00Z>_100 ``` The idea is to target only `<2023-10-01T00:00:00Z>_<0-49>` and `<2023-10-02T00:00:00Z>_<0-49>`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
