Thanks Luke, my plan was to mostly add ones that didn't already exist. I'd also add that there are other techniques (Max-Gumbel Reservoir Sampling for example) that aren't in any common library so far as I know that I happen to know how to implement which might bias towards the general "sketching" library as you say. I generally agree that implementation used should be a detail and not something relevant to users.
On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik <lc...@google.com> wrote: > I would suggest adding it to the existing package(s) (either > sdks/java/extensions or sdks/java/zetasketch or both depending on if you're > replacing existing sketches or adding new ones) since we shouldn't expose > sketching libraries API surface. We should make the API take all the > relevant parameters since this allows us to move between variants and > choose the best sketching library. > > On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev <dev@beam.apache.org> > wrote: > >> I believe that when zetasketch was added, it was also noticeably more >> efficient than other sketch implementations. However this was a number of >> years ago, and I don't know whether it still has an advantage or not. >> >> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev <dev@beam.apache.org> >> wrote: >> >>> Hi everyone, >>> >>> I was looking at adding at least a couple of the sketches from the >>> Apache Datasketches library to the Beam Java SDK and I was wondering if >>> folks had a preference for adding to the existing "sketching" extension vs >>> splitting it out into its own extension? >>> >>> The reason I ask is that there's some overlap (which already exists in >>> zetasketch) between the sketches available in Datasketches vs Beam today, >>> particularly HyperLogLog which would have 3 implementations if we were to >>> add all of them. >>> >>> I don't really have a strong opinion, though personally I'd probably >>> lean towards a single sketching extension (zetasketch being something of a >>> special case as it exists for format compatibility as far as I can tell). >>> But I could see how that could be confusing if you had the Apache >>> Datasketch implementation and the existing implementation derived from the >>> clearspring implementations. >>> >>> Any thoughts? >>> >>> Best, >>> B >>> >>