Everyone is using all sorts of engines: Druid, Spark, Hive, Pig, Splice Machine and so on. I would love to know if there is any Greenplum installation using Datasketches.
On Mon, Jul 12, 2021 at 10:00 AM Matthew Farkas <[email protected]> wrote: > Ah, thanks, Alexander. > > That makes sense, I started digging into cpu usage, and noticed that > queries can only use one cpu in my single-host case. > > So sounds like to use datasketches at this scale, everyone is currently > using druid (if no one is using greenplum)? > > [image: image.png] > > > > *Matthew Z. Farkas* > > Data Science @ Spotify > MS Northwestern University, BS Georgia Tech > > m: (770) 337-2709 > e: [email protected] > > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=qLTR_K9NKiNg1ePOz3nolQUm9_f6BH9WjB1R7pW7kVc&s=ZsFGcOKd7oVEW5UuRd3MBcbNWbVXkyvL-1uFVgqSr9Y&e=> > > > On Mon, Jul 12, 2021 at 12:36 PM Alexander Saydakov > <[email protected]> wrote: > >> Matt, >> I assume you are running a single-host PostgreSQL. If so, your numbers >> don't look too bad I would say. You may want to consider the distributed >> variant, which is Greenplum. However I am not aware of any deployment of >> our extension in such environments. >> >> On Fri, Jul 9, 2021 at 2:41 PM Will Lauer <[email protected]> >> wrote: >> >>> Matt, >>> >>> In my production case, I'm building sketches using java in an ETL >>> pipeline and then loading them into a Druid datamart, which aggregates them >>> together when it receives queries. Queries might aggregate several hundred >>> sketches all the way to many millions (the average number is probably in >>> the 100's of thousands), depending on the time frame involved in the query >>> and the particular dimensions selected. The majority of our queries (95%+) >>> return in less than 10 seconds. This is running on a cluster with between >>> 150 and 200 nodes. >>> >>> We are investigating implementing this in an alternative database, but >>> haven't gotten that database working in a performant way yet (due to some >>> problems with the databases' API, not due to sketches), but are working >>> with the vendor to find some workarounds. >>> >>> Will >>> >>> <http://www.verizonmedia.com> >>> >>> Will Lauer >>> >>> Senior Principal Architect, Audience & Advertising Reporting >>> Data Platforms & Systems Engineering >>> >>> M 508 561 6427 >>> 1908 S. First St >>> Champaign, IL 61822 >>> >>> >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=JkYL6uq0qoDR1Cvko3w9WWpX6sPJ5r64kDiNY_i0Stk&s=57XsolCiQCmaWB6pOS1IQ3j3GHdH3P95fd1GxvaPJ2M&e=> >>> >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=JkYL6uq0qoDR1Cvko3w9WWpX6sPJ5r64kDiNY_i0Stk&s=sYwbOS3PJaMGg7HGlH8AtHTxJrjAr-zbzNptyihuvDM&e=> >>> >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=JkYL6uq0qoDR1Cvko3w9WWpX6sPJ5r64kDiNY_i0Stk&s=KqW7eRDxcvjFALxVALwlain6zSytoHqDJLipg3rSunM&e=> >>> >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=JkYL6uq0qoDR1Cvko3w9WWpX6sPJ5r64kDiNY_i0Stk&s=trPaP60q5pMtDCgPGneuxL0UMfJ8DnavcbkMHNiHj9Y&e=> >>> >>> >>> >>> On Fri, Jul 9, 2021 at 2:57 PM Matthew Farkas <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm running PG 13.3 and pg-datasketches 1.3.0 (I built from master >>>> after running into this issue >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dpostgresql_issues_34&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=4tX6hAxcgLT0zeFgrAKVZ-oxngSqXLrUVy9rYDZIPZE&s=bEI9ZIoMM-58NW0wMXeJ0Ben3Mg0BYk2FamasN9e75A&e=> >>>> ). >>>> >>>> So some rough numbers- I have a week-hour table with 168 user_id >>>> sketches, all would be estimates and not exact, and that is taking 21ms for >>>> unioning those 168 sketches. >>>> - 13k sketches is taking 1-2s >>>> - 13m sketches was taking ~2min yesterday (I must have updated a config >>>> that hurt this, though, I'm cancelling the query after 9mins now) >>>> >>>> Will- >>>> Thanks for the background. So you're combining the sketches in Java- >>>> are you retrieving them from a db? Also, how many sketches are you >>>> typically merging? >>>> >>>> >>>> >>>> *Matthew Z. Farkas* >>>> >>>> Data Science @ Spotify >>>> MS Northwestern University, BS Georgia Tech >>>> >>>> m: (770) 337-2709 >>>> e: [email protected] >>>> >>>> >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=4tX6hAxcgLT0zeFgrAKVZ-oxngSqXLrUVy9rYDZIPZE&s=wR4KZ0n2kgAyu0WCCxyxdMddHWTfnUSaY9H4r9fjJ2U&e=> >>>> >>>> >>>> On Fri, Jul 9, 2021 at 1:53 PM Alexander Saydakov >>>> <[email protected]> wrote: >>>> >>>>> Hi Matt, >>>>> What version of PostgreSQL and DataSketches are you using? >>>>> Could you give some numbers? How many sketches? How long does the >>>>> union take? >>>>> >>>>> The graph you are referring to was based on performance in Druid I >>>>> believe. So it may or may not be transferable to PostgreSQL. We did not do >>>>> a large-scale test in PostgreSQL. >>>>> >>>>> Also we have a performance improvement in the works, which is supposed >>>>> to avoid some cost of deserialization of Theta sketches. It might speed >>>>> things up 10-15% according to some preliminary testing. >>>>> >>>>> >>>>> >>>>> On Fri, Jul 9, 2021 at 10:32 AM Matthew Farkas <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Will, >>>>>> >>>>>> Thanks for the quick response! For your questions: >>>>>> >>>>>> 1. Yup, looking at Theta sketches for set operations. >>>>>> 2. So I'm creating the initial sketches in dataflow like so, with >>>>>> K=4096 (so lgK=12) right now: >>>>>> UpdateSketch userSketch = UpdateSketch.builder().build(K); >>>>>> userSketch.update(requestValue.userId()) >>>>>> // pass to PG using >>>>>> ByteString.copyFrom(userSketch.compact().toByteArray()); >>>>>> 3. By "sketch size", do you mean the number of uniques in each >>>>>> sketch? If so, there's a good bit of variance in sketch size, as I'm >>>>>> segmenting (by dimensions like demo, geo, etc.) users and saving a sketch >>>>>> for each segment. >>>>>> 4. I do not know the proportion that are in direct vs. estimation. >>>>>> (Admittedly, I'm not familiar with the differences there, will check it >>>>>> out.) Is this explicitly set? Or maybe determined based on K & sketch >>>>>> size. >>>>>> >>>>>> One thing I found interesting was that doing a >>>>>> `THETA_SKETCH_UNION(user_id_sketch, 10)` on all sketches vastly improved >>>>>> query time (70s to 6s), and produced the exact same results. I expected >>>>>> the >>>>>> results to be the same, since lgK=12 when originally creating the >>>>>> sketches, >>>>>> but I'm not sure why that would improve query time. >>>>>> >>>>>> Thanks again! >>>>>> >>>>>> >>>>>> >>>>>> *Matthew Z. Farkas* >>>>>> >>>>>> Data Science @ Spotify >>>>>> MS Northwestern University, BS Georgia Tech >>>>>> >>>>>> m: (770) 337-2709 >>>>>> e: [email protected] >>>>>> >>>>>> >>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=zHLsL8UzcCcVZJGnwJ_cAY9tZt12_0GAe-aetSX7hRs&e=> >>>>>> >>>>>> >>>>>> On Fri, Jul 9, 2021 at 1:13 PM Will Lauer >>>>>> <[email protected]> wrote: >>>>>> >>>>>>> Welcome Matt! >>>>>>> >>>>>>> One of the others is probably best qualified to answer your >>>>>>> question, but I'll chime in early with a couple of questions. The >>>>>>> performance of merging depends on many factors, including type of sketch >>>>>>> and sketch size. I'm assuming from the link you posted that you are >>>>>>> dealing >>>>>>> with Theta sketches, for count unique operations. Can you confirm that? >>>>>>> If >>>>>>> so, what's the logK you are using? What is the sketch size? Do you >>>>>>> happen >>>>>>> to know what proportion of your sketches are in estimation mode vs exact >>>>>>> mode? >>>>>>> >>>>>>> Will >>>>>>> >>>>>>> <http://www.verizonmedia.com> >>>>>>> >>>>>>> Will Lauer >>>>>>> >>>>>>> Senior Principal Architect, Audience & Advertising Reporting >>>>>>> Data Platforms & Systems Engineering >>>>>>> >>>>>>> M 508 561 6427 >>>>>>> 1908 S. First St >>>>>>> Champaign, IL 61822 >>>>>>> >>>>>>> >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=jRrfF2nGEDNEOSN9u2TMIRbAao3Qya1dLiv0QLMNIrw&e=> >>>>>>> >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=R7lAUjJWXf1nxnzQVpYAnTkOe0Nj7JensDwaKj9B-r0&e=> >>>>>>> >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=l_zRh61jHy17fBuu9BQPIqxm4y9-HZCwKEtwhH8Qnos&e=> >>>>>>> >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=L5CKzXaeysdQ8JJq0pCGb3V6CM43b-vd-9vUK5qEgk8&e=> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jul 9, 2021 at 12:02 PM Matthew Farkas <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> My name is Matt and I'm a data engineer at Spotify. I'm testing out >>>>>>>> trying Data Sketches with Postgres, and running into some >>>>>>>> performance issues. I'm seeing merge times much slower than what I'm >>>>>>>> seeing >>>>>>>> in the docs here >>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__datasketches.apache.org_docs_Theta_ThetaMergeSpeed.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=R8BDffIXwyiZ46IUKowhz2-gQqGfpM3u-KkwplE4Ing&e=> >>>>>>>> (millions >>>>>>>> of sketches/sec). >>>>>>>> >>>>>>>> In my case, I've pre-computed many sketches, inserted then into PG, >>>>>>>> then I'm running queries in PG and doing the merging there. My hunch is >>>>>>>> that there's something wrong with my Postgres configs, which I've tried >>>>>>>> tweaking extensively but haven't been able to improve query time. >>>>>>>> >>>>>>>> My question is if anyone knows what type of performance can be >>>>>>>> expected in Postgres and if anyone has any examples/tips in general >>>>>>>> from >>>>>>>> their implementations. >>>>>>>> >>>>>>>> Also, this is my first message to this list, so please let me know >>>>>>>> if I should be directing it anywhere else! >>>>>>>> >>>>>>>> Thanks!! >>>>>>>> Matt >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Matthew Z. Farkas* >>>>>>>> >>>>>>>> Data Science @ Spotify >>>>>>>> MS Northwestern University, BS Georgia Tech >>>>>>>> >>>>>>>> m: (770) 337-2709 >>>>>>>> e: [email protected] >>>>>>>> >>>>>>>> >>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=WBAi_Zz2AI6QpCCX6AsWbHRrBwTG4JtAMLfzxzllOU4&e=> >>>>>>>> >>>>>>>
