Re: [E] Re: Proposal to get the NDV of a Range query through KMV

yiifeger wu Thu, 20 May 2021 20:27:12 -0700

Thanks for your prompt reply and it helps a lot !

leerho <[email protected]> 于2021年5月21日周五 上午3:17写道：


> OK, I stand corrected!  Will and Jon are correct.  This can be solved
> using the tuple sketch.  The resulting RSE, however, will be similar to the
> RSE obtained from an intersection.
>
> Lee.
>
>
> On Thu, May 20, 2021 at 11:04 AM leerho <[email protected]> wrote:
>
>> If I understand the use case correctly, perhaps one way to handle this is
>> to filter the values prior to feeding them into a sketch.  If I have
>> millions of input values between 0.0 and 10.0 and only wish to uniquely
>> count the values > 5 and < 7, then do this:
>>
>> if (vi > 5 && vi < 7) { sketch.update(vi); }
>>
>>
>> There is no ordering or distance relationship between the hash of 5 and
>> the hash of 7, nor can one assume that the number of hash values between
>> the hash of 5 and the hash of 7 has any relationship to the number of
>> unique input values between 5 and 7.  So any attempt to perform filtering
>> on the hash values themselves will fail miserably.  And as Jon commented,
>> we strongly discourage users attempting to manipulate the hash values.
>>
>> I think the use-case that Will mentioned using Tuple Sketches is similar
>> to the Engagement Example
>> <https://datasketches.apache.org/docs/Tuple/TupleEngagementExample.html> 
>> (Will,
>> correct me if I'm wrong! ).  This case assumes that the input domain can be
>> neatly divided into a small finite number of regions (in the example, 30
>> days).  It is not clear that Yiifeger's use-case qualifies.  I think what
>> he wants is to be able to sketch the entire domain of continuous
>> numbers between 0.0 and 10.0, and then after the sketch is loaded, somehow
>> apply some query to the sketch to obtain the number of unique values
>> between (pi() and 2*pi()).  I don't think a Tuple sketch would work here,
>> nor can I think of a general solution to this problem.
>>
>> Lee.
>>
>> On Thu, May 20, 2021 at 9:55 AM Jon Malkin <[email protected]> wrote:
>>
>>> Echoing what Will said, you should use a tuple sketch to store the raw
>>> values. You want to avoid any situation where you're storing hash values
>>> yourself.
>>>
>>>   jon
>>>
>>> On Thu, May 20, 2021 at 7:09 AM Will Lauer
>>> <[email protected]> wrote:
>>>
>>>> If you want to do something like this, you should be looking at tuple
>>>> sketches. Tuple sketches use the same KVM algorithm as Theta sketches, but
>>>> allow you to associate extra data with each entry. You can then apply
>>>> operations like filtering based on the tuple value, getting the estimate
>>>> for only those values that satisfy your tuple predicate. We use tuple
>>>> sketches to perform a very similar calculation to the one you are
>>>> describing.
>>>>
>>>> Will
>>>>
>>>> <http://www.verizonmedia.com>
>>>>
>>>> Will Lauer
>>>>
>>>> Senior Principal Architect, Audience & Advertising Reporting
>>>> Data Platforms & Systems Engineering
>>>>
>>>> M 508 561 6427
>>>> 1908 S. First St
>>>> Champaign, IL 61822
>>>>
>>>> <http://www.facebook.com/verizonmedia>
>>>> <http://twitter.com/verizonmedia>
>>>> <https://www.linkedin.com/company/verizon-media/>
>>>> <http://www.instagram.com/verizonmedia>
>>>>
>>>>
>>>>
>>>> On Thu, May 20, 2021 at 6:27 AM yiifeger wu <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Lee,
>>>>>    Thanks for your reply at first.
>>>>>    As for what I proposed, let's take a look for a simple example
>>>>>
>>>>>    Question: get the distinct value of that satisfy (vi > 5 && vi < 7)
>>>>> in the below array
>>>>>    Input: V: [10,8,7,7,1,4,5,6,8,9]
>>>>>
>>>>>    First, if we wanna get the distinct value for the whole data
>>>>> through KMV, we should get the hash value for each value in the array V 
>>>>> and
>>>>> keep the smallest K hashes。
>>>>>    In this example， let K =3。
>>>>>    V:           [10 , 8   ,  7 ,  7 , 1   , 4   , 5    ,6    ,8   ,9 ]
>>>>>    hash:         [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21
>>>>> ,0.23,0.12 ]
>>>>>    the smallest 3 hash : [0.12, 0.2, 0.21]
>>>>>    (Of course, the real hash function will be more random and evenly
>>>>> distributed)
>>>>>
>>>>>    Secondly, we estimate the distinct value for the whole data,
>>>>>     NDV = (3-1)/0.21 ~= 9.5
>>>>>
>>>>>    Then，if we wanna get the distinct value of that satisfy (vi > 5 &&
>>>>> vi < 7) , we can realize it through storing extra 3 origin value for each
>>>>> smallest hash：
>>>>>     smallest 3 hash:   [0.12, 0.2, 0.21]
>>>>>     smallest 3 values: [9   , 10 , 6]
>>>>>     the value satisfy (vi > 5 && vi < 7) is [6]
>>>>>
>>>>>    At last, we can estimate the distinct value of that (vi > 5 && vi <
>>>>> 7):
>>>>>     NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16
>>>>>   in which 1.3 represents the fraction of the smallest 3 hash, that
>>>>> satisfy (vi > 5 && vi < 7) .
>>>>>
>>>>>
>>>>> leerho <[email protected]> 于2021年5月20日周四 上午7:50写道：
>>>>>
>>>>>> Yiifeger,
>>>>>> I'm sorry, but I'm having difficulty in understanding your question
>>>>>> or what you are trying to do.
>>>>>>
>>>>>> In our library, all of our Theta Sketches are derivatives of the KMV
>>>>>> sketch as explained on our website.  And, on our website under *Sketch
>>>>>> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory*
>>>>>> you will find several documents that explain the math behind theta
>>>>>> sketches. Of particular relevance would be the original published paper 
>>>>>> on
>>>>>> the Theta Sketch Framework
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1510.01455v2&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=0-Tqpi8xjR_17PMhGfORmBlIQZ70P5jrjd8F2ENuFP8&e=>.
>>>>>> A simplified development of the math can be found in the short paper 
>>>>>> Theta
>>>>>> Sketch Equations
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dwebsite_blob_master_docs_pdf_ThetaSketchEquations.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=WSAs5sqgluz5ZyA5pEe4doacA9Dcoc0BPD8aRgdhkmY&e=>,
>>>>>> also from the website.
>>>>>>
>>>>>> I hope this helps,
>>>>>>
>>>>>> Lee.
>>>>>>
>>>>>> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>      I recently learned about the DataSketch project that is so
>>>>>>> brilliant, but questions occurred when prepared to utilize it.
>>>>>>>      I want to get the count of distinct values for a range query
>>>>>>> in my project. After some study about the KMV algorithm according to the
>>>>>>> introduction in DataSketch project, we propose an adjusted KMV
>>>>>>> algorithm to solve it.
>>>>>>>       In origin KMV, it only stores K  hash_values and then computes
>>>>>>> the NDV through the average density. So what if we store extra origin
>>>>>>> values for which hash_value contained by the k -Minimum hash_values ?  
>>>>>>> So
>>>>>>> we can estimate the distinct value of the range query through
>>>>>>>
>>>>>>>>           *  ndv_in_the_range = ( ndv_in_range_for_k_minimum / k)
>>>>>>>> *  total_ndv*
>>>>>>>
>>>>>>>
>>>>>>>     So if the idea works and the Sketch does not  implement it,
>>>>>>> could you give some advice
>>>>>>> on how to implement it in this project (P.s prefer the java version).
>>>>>>>      Thanks for your help in advance!
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>> Yifei Wu
>>>>>
>>>>

-- 
Best Regards
Yifei Wu

Re: [E] Re: Proposal to get the NDV of a Range query through KMV

Reply via email to