Thanks Luke, my plan was to mostly add ones that didn't already exist. I'd
also add that there are other techniques (Max-Gumbel Reservoir Sampling for
example) that aren't in any common library so far as I know that I happen
to know how to implement which might bias towards the general "sketching"
library as you say. I generally agree that implementation used should be a
detail and not something relevant to users.

On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik <lc...@google.com> wrote:

> I would suggest adding it to the existing package(s) (either
> sdks/java/extensions or sdks/java/zetasketch or both depending on if you're
> replacing existing sketches or adding new ones) since we shouldn't expose
> sketching libraries API surface. We should make the API take all the
> relevant parameters since this allows us to move between variants and
> choose the best sketching library.
>
> On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev <dev@beam.apache.org>
> wrote:
>
>> I believe that when zetasketch was added, it was also noticeably more
>> efficient than other sketch implementations. However this was a number of
>> years ago, and I don't know whether it still has an advantage or not.
>>
>> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev <dev@beam.apache.org>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I was looking at adding at least a couple of the sketches from the
>>> Apache Datasketches library to the Beam Java SDK and I was wondering if
>>> folks had a preference for adding to the existing "sketching" extension vs
>>> splitting it out into its own extension?
>>>
>>> The reason I ask is that there's some overlap (which already exists in
>>> zetasketch) between the sketches available in Datasketches vs Beam today,
>>> particularly HyperLogLog which would have 3 implementations if we were to
>>> add all of them.
>>>
>>> I don't really have a strong opinion, though personally I'd probably
>>> lean towards a single sketching extension (zetasketch being something of a
>>> special case as it exists for format compatibility as far as I can tell).
>>> But I could see how that could be confusing if you had the Apache
>>> Datasketch implementation and the existing implementation derived from the
>>> clearspring implementations.
>>>
>>> Any thoughts?
>>>
>>> Best,
>>> B
>>>
>>

Reply via email to