Hi Lee,
   Thanks for your reply at first.
   As for what I proposed, let's take a look for a simple example

   Question: get the distinct value of that satisfy (vi > 5 && vi < 7) in
the below array
   Input: V: [10,8,7,7,1,4,5,6,8,9]

   First, if we wanna get the distinct value for the whole data through
KMV, we should get the hash value for each value in the array V and keep
the smallest K hashes。
   In this example, let K =3。
   V:           [10 , 8   ,  7 ,  7 , 1   , 4   , 5    ,6    ,8   ,9 ]
   hash:         [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21 ,0.23,0.12 ]
   the smallest 3 hash : [0.12, 0.2, 0.21]
   (Of course, the real hash function will be more random and evenly
distributed)

   Secondly, we estimate the distinct value for the whole data,
    NDV = (3-1)/0.21 ~= 9.5

   Then,if we wanna get the distinct value of that satisfy (vi > 5 && vi <
7) , we can realize it through storing extra 3 origin value for each
smallest hash:
    smallest 3 hash:   [0.12, 0.2, 0.21]
    smallest 3 values: [9   , 10 , 6]
    the value satisfy (vi > 5 && vi < 7) is [6]

   At last, we can estimate the distinct value of that (vi > 5 && vi < 7):
    NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16
  in which 1.3 represents the fraction of the smallest 3 hash, that satisfy
(vi > 5 && vi < 7) .


leerho <[email protected]> 于2021年5月20日周四 上午7:50写道:

> Yiifeger,
> I'm sorry, but I'm having difficulty in understanding your question or
> what you are trying to do.
>
> In our library, all of our Theta Sketches are derivatives of the KMV
> sketch as explained on our website.  And, on our website under *Sketch
> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory* you
> will find several documents that explain the math behind theta sketches. Of
> particular relevance would be the original published paper on the Theta
> Sketch Framework <https://arxiv.org/abs/1510.01455v2>. A simplified
> development of the math can be found in the short paper Theta Sketch
> Equations
> <https://github.com/apache/datasketches-website/blob/master/docs/pdf/ThetaSketchEquations.pdf>,
> also from the website.
>
> I hope this helps,
>
> Lee.
>
> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <[email protected]> wrote:
>
>> Hi all,
>>      I recently learned about the DataSketch project that is so
>> brilliant, but questions occurred when prepared to utilize it.
>>      I want to get the count of distinct values for a range query in my
>> project. After some study about the KMV algorithm according to the
>> introduction in DataSketch project, we propose an adjusted KMV algorithm
>> to solve it.
>>       In origin KMV, it only stores K  hash_values and then computes the
>> NDV through the average density. So what if we store extra origin values
>> for which hash_value contained by the k -Minimum hash_values ?  So we can
>> estimate the distinct value of the range query through
>>
>>>           *  ndv_in_the_range = ( ndv_in_range_for_k_minimum / k)  *
>>> total_ndv*
>>
>>
>>     So if the idea works and the Sketch does not  implement it, could you
>> give some advice
>> on how to implement it in this project (P.s prefer the java version).
>>      Thanks for your help in advance!
>>
>>

-- 
Best Regards
Yifei Wu

Reply via email to