> > Can you please add this to the design documents webpage. > https://beam.apache.org/contribute/design-documents/ >
Thanks for the reminder. Done! (https://github.com/apache/beam/pull/8947) > I am not sure if this feature should go into 'sdks/java/core' because > it seems a quite specific case, maybe it should go in the sketching > module so it can be easier to find, Adding it to a separate module under `extensions` sounds good to me. > or maybe in its own extension if > the 'mix' of dependencies may be an issue and then make this > dependency a requirement for the gcp module since I suppose the > ultimate goal is to integrate it there. > I guess we can shade dependencies of ZetaSketch if it creates a problem when integrated with Beam. But I would not relate it to a gcp module since I think it will be a useful feature regardless of whether users run it on GCP or not (although if run on GCP, it will get better integration with BigQuery). On Mon, Jun 24, 2019 at 1:55 PM Ismaël Mejía <[email protected]> wrote: > Thanks for bringing this Robin, > > Can you please add this to the design documents webpage. > https://beam.apache.org/contribute/design-documents/ > > Let some comments in the doc, It is great that this is finally open > and even better that it becomes part of Beam. > > I am not sure if this feature should go into 'sdks/java/core' because > it seems a quite specific case, maybe it should go in the sketching > module so it can be easier to find, or maybe in its own extension if > the 'mix' of dependencies may be an issue and then make this > dependency a requirement for the gcp module since I suppose the > ultimate goal is to integrate it there. > > CC [email protected] original author of the sketching > library who may be interested on this. > > > On Mon, Jun 24, 2019 at 9:31 PM Rui Wang <[email protected]> wrote: > > > > Thanks Robin! It would also be interesting if we could offer HLL_COUNT > functions in BeamSQL based on your proposal! > > > > > > -Rui > > > > On Mon, Jun 24, 2019 at 10:47 AM Robin Qiu <[email protected]> wrote: > >> > >> Hi all, > >> > >> I have written a doc proposing we integrate the HyperLogLog++ algorithm > into Beam as a new combiner. The algorithm solves the count-distinct > problem, and the intermediate sketch (aggregator) format will be compatible > with sketches computed via the HLL_COUNT functions in Google Cloud BigQuery > (because they will be based on the same implementation: ZetaSketch). The > tracking JIRA issue is BEAM-7013. > >> > >> The API design proposed in the doc is subject to change and open to > comments. Please feel free to comment if you have any thoughts. > >> > >> Cheers, > >> Robin >
