Hi,

I'm running PG 13.3 and pg-datasketches 1.3.0 (I built from master after
running into this issue
<https://github.com/apache/datasketches-postgresql/issues/34>).

So some rough numbers- I have a week-hour table with 168 user_id sketches,
all would be estimates and not exact, and that is taking 21ms for
unioning those 168 sketches.
- 13k sketches is taking 1-2s
- 13m sketches was taking ~2min yesterday (I must have updated a config
that hurt this, though, I'm cancelling the query after 9mins now)

Will-
Thanks for the background. So you're combining the sketches in Java- are
you retrieving them from a db? Also, how many sketches are you typically
merging?



*Matthew Z. Farkas*

Data Science @ Spotify
MS Northwestern University, BS Georgia Tech

m: (770) 337-2709
e: [email protected]

<https://www.linkedin.com/in/matthewzfarkas>


On Fri, Jul 9, 2021 at 1:53 PM Alexander Saydakov
<[email protected]> wrote:

> Hi Matt,
> What version of PostgreSQL and DataSketches are you using?
> Could you give some numbers? How many sketches? How long does the union
> take?
>
> The graph you are referring to was based on performance in Druid I
> believe. So it may or may not be transferable to PostgreSQL. We did not do
> a large-scale test in PostgreSQL.
>
> Also we have a performance improvement in the works, which is supposed to
> avoid some cost of deserialization of Theta sketches. It might speed things
> up 10-15% according to some preliminary testing.
>
>
>
> On Fri, Jul 9, 2021 at 10:32 AM Matthew Farkas <[email protected]>
> wrote:
>
>> Hi Will,
>>
>> Thanks for the quick response! For your questions:
>>
>> 1. Yup, looking at Theta sketches for set operations.
>> 2. So I'm creating the initial sketches in dataflow like so, with K=4096
>> (so lgK=12) right now:
>>     UpdateSketch userSketch = UpdateSketch.builder().build(K);
>>     userSketch.update(requestValue.userId())
>>     // pass to PG using
>>     ByteString.copyFrom(userSketch.compact().toByteArray());
>> 3. By "sketch size", do you mean the number of uniques in each sketch? If
>> so, there's a good bit of variance in sketch size, as I'm segmenting (by
>> dimensions like demo, geo, etc.) users and saving a sketch for each segment.
>> 4. I do not know the proportion that are in direct vs. estimation.
>> (Admittedly, I'm not familiar with the differences there, will check it
>> out.) Is this explicitly set? Or maybe determined based on K & sketch size.
>>
>> One thing I found interesting was that doing a
>> `THETA_SKETCH_UNION(user_id_sketch, 10)` on all sketches vastly improved
>> query time (70s to 6s), and produced the exact same results. I expected the
>> results to be the same, since lgK=12 when originally creating the sketches,
>> but I'm not sure why that would improve query time.
>>
>> Thanks again!
>>
>>
>>
>> *Matthew Z. Farkas*
>>
>> Data Science @ Spotify
>> MS Northwestern University, BS Georgia Tech
>>
>> m: (770) 337-2709
>> e: [email protected]
>>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=zHLsL8UzcCcVZJGnwJ_cAY9tZt12_0GAe-aetSX7hRs&e=>
>>
>>
>> On Fri, Jul 9, 2021 at 1:13 PM Will Lauer <[email protected]>
>> wrote:
>>
>>> Welcome Matt!
>>>
>>> One of the others is probably best qualified to answer your question,
>>> but I'll chime in early with a couple of questions. The performance of
>>> merging depends on many factors, including type of sketch and sketch size.
>>> I'm assuming from the link you posted that you are dealing with Theta
>>> sketches, for count unique operations. Can you confirm that? If so, what's
>>> the logK you are using? What is the sketch size? Do you happen to know what
>>> proportion of your sketches are in estimation mode vs exact mode?
>>>
>>> Will
>>>
>>> <http://www.verizonmedia.com>
>>>
>>> Will Lauer
>>>
>>> Senior Principal Architect, Audience & Advertising Reporting
>>> Data Platforms & Systems Engineering
>>>
>>> M 508 561 6427
>>> 1908 S. First St
>>> Champaign, IL 61822
>>>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=jRrfF2nGEDNEOSN9u2TMIRbAao3Qya1dLiv0QLMNIrw&e=>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=R7lAUjJWXf1nxnzQVpYAnTkOe0Nj7JensDwaKj9B-r0&e=>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=l_zRh61jHy17fBuu9BQPIqxm4y9-HZCwKEtwhH8Qnos&e=>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=L5CKzXaeysdQ8JJq0pCGb3V6CM43b-vd-9vUK5qEgk8&e=>
>>>
>>>
>>>
>>> On Fri, Jul 9, 2021 at 12:02 PM Matthew Farkas <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> My name is Matt and I'm a data engineer at Spotify. I'm testing out
>>>> trying Data Sketches with Postgres, and running into some
>>>> performance issues. I'm seeing merge times much slower than what I'm seeing
>>>> in the docs here
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__datasketches.apache.org_docs_Theta_ThetaMergeSpeed.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=R8BDffIXwyiZ46IUKowhz2-gQqGfpM3u-KkwplE4Ing&e=>
>>>>  (millions
>>>> of sketches/sec).
>>>>
>>>> In my case, I've pre-computed many sketches, inserted then into PG,
>>>> then I'm running queries in PG and doing the merging there. My hunch is
>>>> that there's something wrong with my Postgres configs, which I've tried
>>>> tweaking extensively but haven't been able to improve query time.
>>>>
>>>> My question is if anyone knows what type of performance can be expected
>>>> in Postgres and if anyone has any examples/tips in general from their
>>>> implementations.
>>>>
>>>> Also, this is my first message to this list, so please let me know if I
>>>> should be directing it anywhere else!
>>>>
>>>> Thanks!!
>>>> Matt
>>>>
>>>>
>>>>
>>>> *Matthew Z. Farkas*
>>>>
>>>> Data Science @ Spotify
>>>> MS Northwestern University, BS Georgia Tech
>>>>
>>>> m: (770) 337-2709
>>>> e: [email protected]
>>>>
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=WBAi_Zz2AI6QpCCX6AsWbHRrBwTG4JtAMLfzxzllOU4&e=>
>>>>
>>>

Reply via email to