Thanks for your prompt reply and it helps a lot ! leerho <[email protected]> 于2021年5月21日周五 上午3:17写道:
> OK, I stand corrected! Will and Jon are correct. This can be solved > using the tuple sketch. The resulting RSE, however, will be similar to the > RSE obtained from an intersection. > > Lee. > > > On Thu, May 20, 2021 at 11:04 AM leerho <[email protected]> wrote: > >> If I understand the use case correctly, perhaps one way to handle this is >> to filter the values prior to feeding them into a sketch. If I have >> millions of input values between 0.0 and 10.0 and only wish to uniquely >> count the values > 5 and < 7, then do this: >> >> if (vi > 5 && vi < 7) { sketch.update(vi); } >> >> >> There is no ordering or distance relationship between the hash of 5 and >> the hash of 7, nor can one assume that the number of hash values between >> the hash of 5 and the hash of 7 has any relationship to the number of >> unique input values between 5 and 7. So any attempt to perform filtering >> on the hash values themselves will fail miserably. And as Jon commented, >> we strongly discourage users attempting to manipulate the hash values. >> >> I think the use-case that Will mentioned using Tuple Sketches is similar >> to the Engagement Example >> <https://datasketches.apache.org/docs/Tuple/TupleEngagementExample.html> >> (Will, >> correct me if I'm wrong! ). This case assumes that the input domain can be >> neatly divided into a small finite number of regions (in the example, 30 >> days). It is not clear that Yiifeger's use-case qualifies. I think what >> he wants is to be able to sketch the entire domain of continuous >> numbers between 0.0 and 10.0, and then after the sketch is loaded, somehow >> apply some query to the sketch to obtain the number of unique values >> between (pi() and 2*pi()). I don't think a Tuple sketch would work here, >> nor can I think of a general solution to this problem. >> >> Lee. >> >> On Thu, May 20, 2021 at 9:55 AM Jon Malkin <[email protected]> wrote: >> >>> Echoing what Will said, you should use a tuple sketch to store the raw >>> values. You want to avoid any situation where you're storing hash values >>> yourself. >>> >>> jon >>> >>> On Thu, May 20, 2021 at 7:09 AM Will Lauer >>> <[email protected]> wrote: >>> >>>> If you want to do something like this, you should be looking at tuple >>>> sketches. Tuple sketches use the same KVM algorithm as Theta sketches, but >>>> allow you to associate extra data with each entry. You can then apply >>>> operations like filtering based on the tuple value, getting the estimate >>>> for only those values that satisfy your tuple predicate. We use tuple >>>> sketches to perform a very similar calculation to the one you are >>>> describing. >>>> >>>> Will >>>> >>>> <http://www.verizonmedia.com> >>>> >>>> Will Lauer >>>> >>>> Senior Principal Architect, Audience & Advertising Reporting >>>> Data Platforms & Systems Engineering >>>> >>>> M 508 561 6427 >>>> 1908 S. First St >>>> Champaign, IL 61822 >>>> >>>> <http://www.facebook.com/verizonmedia> >>>> <http://twitter.com/verizonmedia> >>>> <https://www.linkedin.com/company/verizon-media/> >>>> <http://www.instagram.com/verizonmedia> >>>> >>>> >>>> >>>> On Thu, May 20, 2021 at 6:27 AM yiifeger wu <[email protected]> >>>> wrote: >>>> >>>>> Hi Lee, >>>>> Thanks for your reply at first. >>>>> As for what I proposed, let's take a look for a simple example >>>>> >>>>> Question: get the distinct value of that satisfy (vi > 5 && vi < 7) >>>>> in the below array >>>>> Input: V: [10,8,7,7,1,4,5,6,8,9] >>>>> >>>>> First, if we wanna get the distinct value for the whole data >>>>> through KMV, we should get the hash value for each value in the array V >>>>> and >>>>> keep the smallest K hashes。 >>>>> In this example, let K =3。 >>>>> V: [10 , 8 , 7 , 7 , 1 , 4 , 5 ,6 ,8 ,9 ] >>>>> hash: [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21 >>>>> ,0.23,0.12 ] >>>>> the smallest 3 hash : [0.12, 0.2, 0.21] >>>>> (Of course, the real hash function will be more random and evenly >>>>> distributed) >>>>> >>>>> Secondly, we estimate the distinct value for the whole data, >>>>> NDV = (3-1)/0.21 ~= 9.5 >>>>> >>>>> Then,if we wanna get the distinct value of that satisfy (vi > 5 && >>>>> vi < 7) , we can realize it through storing extra 3 origin value for each >>>>> smallest hash: >>>>> smallest 3 hash: [0.12, 0.2, 0.21] >>>>> smallest 3 values: [9 , 10 , 6] >>>>> the value satisfy (vi > 5 && vi < 7) is [6] >>>>> >>>>> At last, we can estimate the distinct value of that (vi > 5 && vi < >>>>> 7): >>>>> NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16 >>>>> in which 1.3 represents the fraction of the smallest 3 hash, that >>>>> satisfy (vi > 5 && vi < 7) . >>>>> >>>>> >>>>> leerho <[email protected]> 于2021年5月20日周四 上午7:50写道: >>>>> >>>>>> Yiifeger, >>>>>> I'm sorry, but I'm having difficulty in understanding your question >>>>>> or what you are trying to do. >>>>>> >>>>>> In our library, all of our Theta Sketches are derivatives of the KMV >>>>>> sketch as explained on our website. And, on our website under *Sketch >>>>>> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory* >>>>>> you will find several documents that explain the math behind theta >>>>>> sketches. Of particular relevance would be the original published paper >>>>>> on >>>>>> the Theta Sketch Framework >>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1510.01455v2&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=0-Tqpi8xjR_17PMhGfORmBlIQZ70P5jrjd8F2ENuFP8&e=>. >>>>>> A simplified development of the math can be found in the short paper >>>>>> Theta >>>>>> Sketch Equations >>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dwebsite_blob_master_docs_pdf_ThetaSketchEquations.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=WSAs5sqgluz5ZyA5pEe4doacA9Dcoc0BPD8aRgdhkmY&e=>, >>>>>> also from the website. >>>>>> >>>>>> I hope this helps, >>>>>> >>>>>> Lee. >>>>>> >>>>>> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> I recently learned about the DataSketch project that is so >>>>>>> brilliant, but questions occurred when prepared to utilize it. >>>>>>> I want to get the count of distinct values for a range query >>>>>>> in my project. After some study about the KMV algorithm according to the >>>>>>> introduction in DataSketch project, we propose an adjusted KMV >>>>>>> algorithm to solve it. >>>>>>> In origin KMV, it only stores K hash_values and then computes >>>>>>> the NDV through the average density. So what if we store extra origin >>>>>>> values for which hash_value contained by the k -Minimum hash_values ? >>>>>>> So >>>>>>> we can estimate the distinct value of the range query through >>>>>>> >>>>>>>> * ndv_in_the_range = ( ndv_in_range_for_k_minimum / k) >>>>>>>> * total_ndv* >>>>>>> >>>>>>> >>>>>>> So if the idea works and the Sketch does not implement it, >>>>>>> could you give some advice >>>>>>> on how to implement it in this project (P.s prefer the java version). >>>>>>> Thanks for your help in advance! >>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> Yifei Wu >>>>> >>>> -- Best Regards Yifei Wu
