>
> Can you please add this to the design documents webpage.
> https://beam.apache.org/contribute/design-documents/
>

Thanks for the reminder. Done! (https://github.com/apache/beam/pull/8947)


> I am not sure if this feature should go into 'sdks/java/core' because
> it seems a quite specific case, maybe it should go in the sketching
> module so it can be easier to find,


Adding it to a separate module under `extensions` sounds good to me.


> or maybe in its own extension if
> the 'mix' of dependencies may be an issue and then make this
> dependency a requirement for the gcp module since I suppose the
> ultimate goal is to integrate it there.
>

I guess we can shade dependencies of ZetaSketch if it creates a problem
when integrated with Beam. But I would not relate it to a gcp module since
I think it will be a useful feature regardless of whether users run it on
GCP or not (although if run on GCP, it will get better integration with
BigQuery).

On Mon, Jun 24, 2019 at 1:55 PM Ismaël Mejía <[email protected]> wrote:

> Thanks for bringing this Robin,
>
> Can you please add this to the design documents webpage.
> https://beam.apache.org/contribute/design-documents/
>
> Let some comments in the doc, It is great that this is finally open
> and even better that it becomes part of Beam.
>
> I am not sure if this feature should go into 'sdks/java/core' because
> it seems a quite specific case, maybe it should go in the sketching
> module so it can be easier to find, or maybe in its own extension if
> the 'mix' of dependencies may be an issue and then make this
> dependency a requirement for the gcp module since I suppose the
> ultimate goal is to integrate it there.
>
> CC [email protected] original author of the sketching
> library who may be interested on this.
>
>
> On Mon, Jun 24, 2019 at 9:31 PM Rui Wang <[email protected]> wrote:
> >
> > Thanks Robin! It would also be interesting if we could offer HLL_COUNT
> functions in BeamSQL based on your proposal!
> >
> >
> > -Rui
> >
> > On Mon, Jun 24, 2019 at 10:47 AM Robin Qiu <[email protected]> wrote:
> >>
> >> Hi all,
> >>
> >> I have written a doc proposing we integrate the HyperLogLog++ algorithm
> into Beam as a new combiner. The algorithm solves the count-distinct
> problem, and the intermediate sketch (aggregator) format will be compatible
> with sketches computed via the HLL_COUNT functions in Google Cloud BigQuery
> (because they will be based on the same implementation: ZetaSketch). The
> tracking JIRA issue is BEAM-7013.
> >>
> >> The API design proposed in the doc is subject to change and open to
> comments. Please feel free to comment if you have any thoughts.
> >>
> >> Cheers,
> >> Robin
>

Reply via email to