Quick update: the PR implementing this feature has been sent out: https://github.com/apache/beam/pull/9144. The design doc is also revamped to reflect the design decisions we have made.
On Tue, Jun 25, 2019 at 2:05 PM Robin Qiu <robi...@google.com> wrote: > Can you please add this to the design documents webpage. >> https://beam.apache.org/contribute/design-documents/ >> > > Thanks for the reminder. Done! (https://github.com/apache/beam/pull/8947) > > >> I am not sure if this feature should go into 'sdks/java/core' because >> it seems a quite specific case, maybe it should go in the sketching >> module so it can be easier to find, > > > Adding it to a separate module under `extensions` sounds good to me. > > >> or maybe in its own extension if >> the 'mix' of dependencies may be an issue and then make this >> dependency a requirement for the gcp module since I suppose the >> ultimate goal is to integrate it there. >> > > I guess we can shade dependencies of ZetaSketch if it creates a problem > when integrated with Beam. But I would not relate it to a gcp module since > I think it will be a useful feature regardless of whether users run it on > GCP or not (although if run on GCP, it will get better integration with > BigQuery). > > On Mon, Jun 24, 2019 at 1:55 PM Ismaël Mejía <ieme...@gmail.com> wrote: > >> Thanks for bringing this Robin, >> >> Can you please add this to the design documents webpage. >> https://beam.apache.org/contribute/design-documents/ >> >> Let some comments in the doc, It is great that this is finally open >> and even better that it becomes part of Beam. >> >> I am not sure if this feature should go into 'sdks/java/core' because >> it seems a quite specific case, maybe it should go in the sketching >> module so it can be easier to find, or maybe in its own extension if >> the 'mix' of dependencies may be an issue and then make this >> dependency a requirement for the gcp module since I suppose the >> ultimate goal is to integrate it there. >> >> CC +arnaudfournier...@gmail.com original author of the sketching >> library who may be interested on this. >> >> >> On Mon, Jun 24, 2019 at 9:31 PM Rui Wang <ruw...@google.com> wrote: >> > >> > Thanks Robin! It would also be interesting if we could offer HLL_COUNT >> functions in BeamSQL based on your proposal! >> > >> > >> > -Rui >> > >> > On Mon, Jun 24, 2019 at 10:47 AM Robin Qiu <robi...@google.com> wrote: >> >> >> >> Hi all, >> >> >> >> I have written a doc proposing we integrate the HyperLogLog++ >> algorithm into Beam as a new combiner. The algorithm solves the >> count-distinct problem, and the intermediate sketch (aggregator) format >> will be compatible with sketches computed via the HLL_COUNT functions in >> Google Cloud BigQuery (because they will be based on the same >> implementation: ZetaSketch). The tracking JIRA issue is BEAM-7013. >> >> >> >> The API design proposed in the doc is subject to change and open to >> comments. Please feel free to comment if you have any thoughts. >> >> >> >> Cheers, >> >> Robin >> >