Hi Will,

Thanks for the quick response! For your questions:

1. Yup, looking at Theta sketches for set operations.
2. So I'm creating the initial sketches in dataflow like so, with K=4096
(so lgK=12) right now:
    UpdateSketch userSketch = UpdateSketch.builder().build(K);
    userSketch.update(requestValue.userId())
    // pass to PG using
    ByteString.copyFrom(userSketch.compact().toByteArray());
3. By "sketch size", do you mean the number of uniques in each sketch? If
so, there's a good bit of variance in sketch size, as I'm segmenting (by
dimensions like demo, geo, etc.) users and saving a sketch for each segment.
4. I do not know the proportion that are in direct vs. estimation.
(Admittedly, I'm not familiar with the differences there, will check it
out.) Is this explicitly set? Or maybe determined based on K & sketch size.

One thing I found interesting was that doing a
`THETA_SKETCH_UNION(user_id_sketch, 10)` on all sketches vastly improved
query time (70s to 6s), and produced the exact same results. I expected the
results to be the same, since lgK=12 when originally creating the sketches,
but I'm not sure why that would improve query time.

Thanks again!



*Matthew Z. Farkas*

Data Science @ Spotify
MS Northwestern University, BS Georgia Tech

m: (770) 337-2709
e: [email protected]

<https://www.linkedin.com/in/matthewzfarkas>


On Fri, Jul 9, 2021 at 1:13 PM Will Lauer <[email protected]>
wrote:

> Welcome Matt!
>
> One of the others is probably best qualified to answer your question, but
> I'll chime in early with a couple of questions. The performance of merging
> depends on many factors, including type of sketch and sketch size. I'm
> assuming from the link you posted that you are dealing with Theta sketches,
> for count unique operations. Can you confirm that? If so, what's the logK
> you are using? What is the sketch size? Do you happen to know what
> proportion of your sketches are in estimation mode vs exact mode?
>
> Will
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
>    <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Fri, Jul 9, 2021 at 12:02 PM Matthew Farkas <[email protected]>
> wrote:
>
>> Hi,
>>
>> My name is Matt and I'm a data engineer at Spotify. I'm testing out
>> trying Data Sketches with Postgres, and running into some
>> performance issues. I'm seeing merge times much slower than what I'm seeing
>> in the docs here
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__datasketches.apache.org_docs_Theta_ThetaMergeSpeed.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=R8BDffIXwyiZ46IUKowhz2-gQqGfpM3u-KkwplE4Ing&e=>
>>  (millions
>> of sketches/sec).
>>
>> In my case, I've pre-computed many sketches, inserted then into PG, then
>> I'm running queries in PG and doing the merging there. My hunch is that
>> there's something wrong with my Postgres configs, which I've tried tweaking
>> extensively but haven't been able to improve query time.
>>
>> My question is if anyone knows what type of performance can be expected
>> in Postgres and if anyone has any examples/tips in general from their
>> implementations.
>>
>> Also, this is my first message to this list, so please let me know if I
>> should be directing it anywhere else!
>>
>> Thanks!!
>> Matt
>>
>>
>>
>> *Matthew Z. Farkas*
>>
>> Data Science @ Spotify
>> MS Northwestern University, BS Georgia Tech
>>
>> m: (770) 337-2709
>> e: [email protected]
>>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=WBAi_Zz2AI6QpCCX6AsWbHRrBwTG4JtAMLfzxzllOU4&e=>
>>
>

Reply via email to