Re: Proposal to get the NDV of a Range query through KMV

leerho Wed, 19 May 2021 16:50:38 -0700

Yiifeger,
I'm sorry, but I'm having difficulty in understanding your question or what
you are trying to do.

In our library, all of our Theta Sketches are derivatives of the KMV sketch
as explained on our website.  And, on our website under *Sketch Families /
Distinct Counting / Theta Sketches / Theta Sketch Theory* you will find
several documents that explain the math behind theta sketches. Of
particular relevance would be the original published paper on the Theta
Sketch Framework <https://arxiv.org/abs/1510.01455v2>. A simplified
development of the math can be found in the short paper Theta Sketch
Equations
<https://github.com/apache/datasketches-website/blob/master/docs/pdf/ThetaSketchEquations.pdf>,
also from the website.

I hope this helps,

Lee.

On Wed, May 19, 2021 at 9:39 AM yiifeger wu <[email protected]> wrote:

> Hi all,
>      I recently learned about the DataSketch project that is so brilliant,
> but questions occurred when prepared to utilize it.
>      I want to get the count of distinct values for a range query in my
> project. After some study about the KMV algorithm according to the
> introduction in DataSketch project, we propose an adjusted KMV algorithm
> to solve it.
>       In origin KMV, it only stores K  hash_values and then computes the
> NDV through the average density. So what if we store extra origin values
> for which hash_value contained by the k -Minimum hash_values ?  So we can
> estimate the distinct value of the range query through
>
>>           *  ndv_in_the_range = ( ndv_in_range_for_k_minimum / k)  *
>> total_ndv*
>
>
>     So if the idea works and the Sketch does not  implement it, could you
> give some advice
> on how to implement it in this project (P.s prefer the java version).
>      Thanks for your help in advance!
>
>

Re: Proposal to get the NDV of a Range query through KMV

Reply via email to