Re: [E] Postgres Performance Question

Will Lauer Fri, 09 Jul 2021 10:59:40 -0700

What we've found in our code (I'm primarily using the Java version of Theta
sketches, although I've also been experimenting with the C++ version), is
that the merge time depends heavily on the overall size of the sketch. The
theta sketch works by keeping a sample of the values inserted into it, so
the size will vary the number of values that you have inserted into it.
While there are fewer values than K, the sketch is in exact mode, retaining
ALL the values, but once you exceed K, you have reached the maximum sketch
size and sampling starts, putting the sketch into estimation mode. The data
set I primarily deal with has a power law distribution - the majority of my
sketches are smaller, in exact mode, some with only a couple of unique
values, while a small portion are fully populated and are in estimation
mode. Unioning the sketches that are in estimation mode is obviously slower
that those in exact mode, as it has less data to process.


Unioning sketches with a smaller K is faster, becuase the overall size of
the sketches is smaller. Sketches with logK=10 have a maximum size 4 times
smaller that logK=12. If most of your sketches are in estimation mode, this
could mean a 4x difference in performance, but the difference will be
smaller the more sketches you have in exact mode. The difference between
logK=10 and logK=12 should be minimal if your sketches are in exact mode.
If they are in estimation mode, they should give similar (but not exactly
the same) estimates, but with different error bounds.

I've been doing some experimentation with the C++ version of theta sketches
and another database integration. I'm using logK 14 and 16, and seeing
100k+/sec when doing merge operations, but we definitely haven't finished
optimizing that integration and haven't run it on production quatlity
hardware yet.

WIll

<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>



On Fri, Jul 9, 2021 at 12:32 PM Matthew Farkas <[email protected]> wrote:

> Hi Will,
>
> Thanks for the quick response! For your questions:
>
> 1. Yup, looking at Theta sketches for set operations.
> 2. So I'm creating the initial sketches in dataflow like so, with K=4096
> (so lgK=12) right now:
>     UpdateSketch userSketch = UpdateSketch.builder().build(K);
>     userSketch.update(requestValue.userId())
>     // pass to PG using
>     ByteString.copyFrom(userSketch.compact().toByteArray());
> 3. By "sketch size", do you mean the number of uniques in each sketch? If
> so, there's a good bit of variance in sketch size, as I'm segmenting (by
> dimensions like demo, geo, etc.) users and saving a sketch for each segment.
> 4. I do not know the proportion that are in direct vs. estimation.
> (Admittedly, I'm not familiar with the differences there, will check it
> out.) Is this explicitly set? Or maybe determined based on K & sketch size.
>
> One thing I found interesting was that doing a
> `THETA_SKETCH_UNION(user_id_sketch, 10)` on all sketches vastly improved
> query time (70s to 6s), and produced the exact same results. I expected the
> results to be the same, since lgK=12 when originally creating the sketches,
> but I'm not sure why that would improve query time.
>
> Thanks again!
>
>
>
> *Matthew Z. Farkas*
>
> Data Science @ Spotify
> MS Northwestern University, BS Georgia Tech
>
> m: (770) 337-2709
> e: [email protected]
>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=hlBJ_oEHx6EcGHH16svgyOOeXG_1t90iq2o2ZSTP9Ag&e=>
>
>
> On Fri, Jul 9, 2021 at 1:13 PM Will Lauer <[email protected]>
> wrote:
>
>> Welcome Matt!
>>
>> One of the others is probably best qualified to answer your question, but
>> I'll chime in early with a couple of questions. The performance of merging
>> depends on many factors, including type of sketch and sketch size. I'm
>> assuming from the link you posted that you are dealing with Theta sketches,
>> for count unique operations. Can you confirm that? If so, what's the logK
>> you are using? What is the sketch size? Do you happen to know what
>> proportion of your sketches are in estimation mode vs exact mode?
>>
>> Will
>>
>> <http://www.verizonmedia.com>
>>
>> Will Lauer
>>
>> Senior Principal Architect, Audience & Advertising Reporting
>> Data Platforms & Systems Engineering
>>
>> M 508 561 6427
>> 1908 S. First St
>> Champaign, IL 61822
>>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=_gzRAxTGuR3xXP-F5SfMl9m9RrlXxIfkwWWS48FwF40&e=>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=WK_9dkSMOWNjIBBGadxxXusdMUM3C6BN3OHluOa4rl8&e=>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=2dlLMi-34dvKkp73rrNZ7MBqCOVabEyiqAmWRwVVzFw&e=>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=_B4g9-sQFrwL2GzbG1fjNfeXKWB07Ba_UTtlvY4fnoA&s=MTTPriU7r0cuq7vPMnAvC4ah_nrn4HvP4j8jSvLwgAc&e=>
>>
>>
>>
>> On Fri, Jul 9, 2021 at 12:02 PM Matthew Farkas <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> My name is Matt and I'm a data engineer at Spotify. I'm testing out
>>> trying Data Sketches with Postgres, and running into some
>>> performance issues. I'm seeing merge times much slower than what I'm seeing
>>> in the docs here
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__datasketches.apache.org_docs_Theta_ThetaMergeSpeed.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=R8BDffIXwyiZ46IUKowhz2-gQqGfpM3u-KkwplE4Ing&e=>
>>>  (millions
>>> of sketches/sec).
>>>
>>> In my case, I've pre-computed many sketches, inserted then into PG, then
>>> I'm running queries in PG and doing the merging there. My hunch is that
>>> there's something wrong with my Postgres configs, which I've tried tweaking
>>> extensively but haven't been able to improve query time.
>>>
>>> My question is if anyone knows what type of performance can be expected
>>> in Postgres and if anyone has any examples/tips in general from their
>>> implementations.
>>>
>>> Also, this is my first message to this list, so please let me know if I
>>> should be directing it anywhere else!
>>>
>>> Thanks!!
>>> Matt
>>>
>>>
>>>
>>> *Matthew Z. Farkas*
>>>
>>> Data Science @ Spotify
>>> MS Northwestern University, BS Georgia Tech
>>>
>>> m: (770) 337-2709
>>> e: [email protected]
>>>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=WBAi_Zz2AI6QpCCX6AsWbHRrBwTG4JtAMLfzxzllOU4&e=>
>>>
>>

Re: [E] Postgres Performance Question

Reply via email to