Another enhancement/modification to the sketching library might be to
introduce generic encodings for at least the major sketches (HLL, Bloom,
Count-Min) that can translate into the major implementations. Talking with
Kenn it sounds like zetasketch has the side benefit of using an encoding
compatible with BigQuery, but in general I think it would be a nice thing
to let users store the sketch payload in, say, files that they could then
be confident would still be mergeable even if the underlying implementation
of that sketch changed.

On Wed, Jan 18, 2023 at 11:50 AM Byron Ellis <byronel...@google.com> wrote:

> Thanks Luke, my plan was to mostly add ones that didn't already exist. I'd
> also add that there are other techniques (Max-Gumbel Reservoir Sampling for
> example) that aren't in any common library so far as I know that I happen
> to know how to implement which might bias towards the general "sketching"
> library as you say. I generally agree that implementation used should be a
> detail and not something relevant to users.
>
> On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik <lc...@google.com> wrote:
>
>> I would suggest adding it to the existing package(s) (either
>> sdks/java/extensions or sdks/java/zetasketch or both depending on if you're
>> replacing existing sketches or adding new ones) since we shouldn't expose
>> sketching libraries API surface. We should make the API take all the
>> relevant parameters since this allows us to move between variants and
>> choose the best sketching library.
>>
>> On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev <dev@beam.apache.org>
>> wrote:
>>
>>> I believe that when zetasketch was added, it was also noticeably more
>>> efficient than other sketch implementations. However this was a number of
>>> years ago, and I don't know whether it still has an advantage or not.
>>>
>>> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I was looking at adding at least a couple of the sketches from the
>>>> Apache Datasketches library to the Beam Java SDK and I was wondering if
>>>> folks had a preference for adding to the existing "sketching" extension vs
>>>> splitting it out into its own extension?
>>>>
>>>> The reason I ask is that there's some overlap (which already exists in
>>>> zetasketch) between the sketches available in Datasketches vs Beam today,
>>>> particularly HyperLogLog which would have 3 implementations if we were to
>>>> add all of them.
>>>>
>>>> I don't really have a strong opinion, though personally I'd probably
>>>> lean towards a single sketching extension (zetasketch being something of a
>>>> special case as it exists for format compatibility as far as I can tell).
>>>> But I could see how that could be confusing if you had the Apache
>>>> Datasketch implementation and the existing implementation derived from the
>>>> clearspring implementations.
>>>>
>>>> Any thoughts?
>>>>
>>>> Best,
>>>> B
>>>>
>>>

Reply via email to