My take: it is useful to isolate dependencies. So, packages that are based on specific other projects like Apache Datasketches benefit from being in their own isolated module in Beam, separate from the Zetasketch-based package.
Having a generalized "sketching" package that abstracts away the details so that we can swap out implementation should be a third thing independent of the others IMO and could have some sort of plugin architecture. It is overengineering to do so at this point. And like Byron brought up, a key aspects of sketches is their serialized form being compatible so the user really needs to know exactly what implementation they are using. Kenn On Wed, Jan 18, 2023 at 12:22 PM Byron Ellis via dev <dev@beam.apache.org> wrote: > Another enhancement/modification to the sketching library might be to > introduce generic encodings for at least the major sketches (HLL, Bloom, > Count-Min) that can translate into the major implementations. Talking with > Kenn it sounds like zetasketch has the side benefit of using an encoding > compatible with BigQuery, but in general I think it would be a nice thing > to let users store the sketch payload in, say, files that they could then > be confident would still be mergeable even if the underlying implementation > of that sketch changed. > > On Wed, Jan 18, 2023 at 11:50 AM Byron Ellis <byronel...@google.com> > wrote: > >> Thanks Luke, my plan was to mostly add ones that didn't already exist. >> I'd also add that there are other techniques (Max-Gumbel Reservoir Sampling >> for example) that aren't in any common library so far as I know that I >> happen to know how to implement which might bias towards the general >> "sketching" library as you say. I generally agree that implementation used >> should be a detail and not something relevant to users. >> >> On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik <lc...@google.com> wrote: >> >>> I would suggest adding it to the existing package(s) (either >>> sdks/java/extensions or sdks/java/zetasketch or both depending on if you're >>> replacing existing sketches or adding new ones) since we shouldn't expose >>> sketching libraries API surface. We should make the API take all the >>> relevant parameters since this allows us to move between variants and >>> choose the best sketching library. >>> >>> On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev <dev@beam.apache.org> >>> wrote: >>> >>>> I believe that when zetasketch was added, it was also noticeably more >>>> efficient than other sketch implementations. However this was a number of >>>> years ago, and I don't know whether it still has an advantage or not. >>>> >>>> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev < >>>> dev@beam.apache.org> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> I was looking at adding at least a couple of the sketches from the >>>>> Apache Datasketches library to the Beam Java SDK and I was wondering if >>>>> folks had a preference for adding to the existing "sketching" extension vs >>>>> splitting it out into its own extension? >>>>> >>>>> The reason I ask is that there's some overlap (which already exists in >>>>> zetasketch) between the sketches available in Datasketches vs Beam today, >>>>> particularly HyperLogLog which would have 3 implementations if we were to >>>>> add all of them. >>>>> >>>>> I don't really have a strong opinion, though personally I'd probably >>>>> lean towards a single sketching extension (zetasketch being something of a >>>>> special case as it exists for format compatibility as far as I can tell). >>>>> But I could see how that could be confusing if you had the Apache >>>>> Datasketch implementation and the existing implementation derived from the >>>>> clearspring implementations. >>>>> >>>>> Any thoughts? >>>>> >>>>> Best, >>>>> B >>>>> >>>>