What we've found in our code (I'm primarily using the Java version of Theta sketches, although I've also been experimenting with the C++ version), is that the merge time depends heavily on the overall size of the sketch. The theta sketch works by keeping a sample of the values inserted into it, so the size will vary the number of values that you have inserted into it. While there are fewer values than K, the sketch is in exact mode, retaining ALL the values, but once you exceed K, you have reached the maximum sketch size and sampling starts, putting the sketch into estimation mode. The data set I primarily deal with has a power law distribution - the majority of my sketches are smaller, in exact mode, some with only a couple of unique values, while a small portion are fully populated and are in estimation mode. Unioning the sketches that are in estimation mode is obviously slower that those in exact mode, as it has less data to process.
Unioning sketches with a smaller K is faster, becuase the overall size of the sketches is smaller. Sketches with logK=10 have a maximum size 4 times smaller that logK=12. If most of your sketches are in estimation mode, this could mean a 4x difference in performance, but the difference will be smaller the more sketches you have in exact mode. The difference between logK=10 and logK=12 should be minimal if your sketches are in exact mode. If they are in estimation mode, they should give similar (but not exactly the same) estimates, but with different error bounds. I've been doing some experimentation with the C++ version of theta sketches and another database integration. I'm using logK 14 and 16, and seeing 100k+/sec when doing merge operations, but we definitely haven't finished optimizing that integration and haven't run it on production quatlity hardware yet. WIll <http://www.verizonmedia.com> Will Lauer Senior Principal Architect, Audience & Advertising Reporting Data Platforms & Systems Engineering M 508 561 6427 1908 S. First St Champaign, IL 61822 <http://www.facebook.com/verizonmedia> <http://twitter.com/verizonmedia> <https://www.linkedin.com/company/verizon-media/> <http://www.instagram.com/verizonmedia> On Fri, Jul 9, 2021 at 12:32 PM Matthew Farkas <[email protected]> wrote: > Hi Will, > > Thanks for the quick response! For your questions: > > 1. Yup, looking at Theta sketches for set operations. > 2. So I'm creating the initial sketches in dataflow like so, with K=4096 > (so lgK=12) right now: > UpdateSketch userSketch = UpdateSketch.builder().build(K); > userSketch.update(requestValue.userId()) > // pass to PG using > ByteString.copyFrom(userSketch.compact().toByteArray()); > 3. By "sketch size", do you mean the number of uniques in each sketch? If > so, there's a good bit of variance in sketch size, as I'm segmenting (by > dimensions like demo, geo, etc.) users and saving a sketch for each segment. > 4. I do not know the proportion that are in direct vs. estimation. > (Admittedly, I'm not familiar with the differences there, will check it > out.) Is this explicitly set? Or maybe determined based on K & sketch size. > > One thing I found interesting was that doing a > `THETA_SKETCH_UNION(user_id_sketch, 10)` on all sketches vastly improved > query time (70s to 6s), and produced the exact same results. I expected the > results to be the same, since lgK=12 when originally creating the sketches, > but I'm not sure why that would improve query time. > > Thanks again! > > > > *Matthew Z. Farkas* > > Data Science @ Spotify > MS Northwestern University, BS Georgia Tech > > m: (770) 337-2709 > e: [email protected] > > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=hlBJ_oEHx6EcGHH16svgyOOeXG_1t90iq2o2ZSTP9Ag&e=> > > > On Fri, Jul 9, 2021 at 1:13 PM Will Lauer <[email protected]> > wrote: > >> Welcome Matt! >> >> One of the others is probably best qualified to answer your question, but >> I'll chime in early with a couple of questions. The performance of merging >> depends on many factors, including type of sketch and sketch size. I'm >> assuming from the link you posted that you are dealing with Theta sketches, >> for count unique operations. Can you confirm that? If so, what's the logK >> you are using? What is the sketch size? Do you happen to know what >> proportion of your sketches are in estimation mode vs exact mode? >> >> Will >> >> <http://www.verizonmedia.com> >> >> Will Lauer >> >> Senior Principal Architect, Audience & Advertising Reporting >> Data Platforms & Systems Engineering >> >> M 508 561 6427 >> 1908 S. First St >> Champaign, IL 61822 >> >> >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=_gzRAxTGuR3xXP-F5SfMl9m9RrlXxIfkwWWS48FwF40&e=> >> >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=WK_9dkSMOWNjIBBGadxxXusdMUM3C6BN3OHluOa4rl8&e=> >> >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=2dlLMi-34dvKkp73rrNZ7MBqCOVabEyiqAmWRwVVzFw&e=> >> >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=MTTPriU7r0cuq7vPMnAvC4ah_nrn4HvP4j8jSvLwgAc&e=> >> >> >> >> On Fri, Jul 9, 2021 at 12:02 PM Matthew Farkas <[email protected]> >> wrote: >> >>> Hi, >>> >>> My name is Matt and I'm a data engineer at Spotify. I'm testing out >>> trying Data Sketches with Postgres, and running into some >>> performance issues. I'm seeing merge times much slower than what I'm seeing >>> in the docs here >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__datasketches.apache.org_docs_Theta_ThetaMergeSpeed.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=R8BDffIXwyiZ46IUKowhz2-gQqGfpM3u-KkwplE4Ing&e=> >>> (millions >>> of sketches/sec). >>> >>> In my case, I've pre-computed many sketches, inserted then into PG, then >>> I'm running queries in PG and doing the merging there. My hunch is that >>> there's something wrong with my Postgres configs, which I've tried tweaking >>> extensively but haven't been able to improve query time. >>> >>> My question is if anyone knows what type of performance can be expected >>> in Postgres and if anyone has any examples/tips in general from their >>> implementations. >>> >>> Also, this is my first message to this list, so please let me know if I >>> should be directing it anywhere else! >>> >>> Thanks!! >>> Matt >>> >>> >>> >>> *Matthew Z. Farkas* >>> >>> Data Science @ Spotify >>> MS Northwestern University, BS Georgia Tech >>> >>> m: (770) 337-2709 >>> e: [email protected] >>> >>> >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=WBAi_Zz2AI6QpCCX6AsWbHRrBwTG4JtAMLfzxzllOU4&e=> >>> >>
