Thanks Robin! It would also be interesting if we could offer HLL_COUNT functions in BeamSQL based on your proposal!
-Rui On Mon, Jun 24, 2019 at 10:47 AM Robin Qiu <[email protected]> wrote: > Hi all, > > I have written a doc > <https://docs.google.com/document/d/1U5aXdC9lDSOqT6FPHRulp-EutYiQ9KeHpgu-19CIfEI> > proposing we integrate the HyperLogLog++ algorithm > <http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf> > into Beam as a new combiner. The algorithm solves the count-distinct > problem <https://en.wikipedia.org/wiki/Count-distinct_problem>, and the > intermediate sketch (aggregator) format will be compatible with sketches > computed via the HLL_COUNT functions > <https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions> > in Google Cloud BigQuery (because they will be based on the same > implementation: ZetaSketch <https://github.com/google/zetasketch>). The > tracking JIRA issue is BEAM-7013 > <https://issues.apache.org/jira/browse/BEAM-7013>. > > The API design proposed in the doc is subject to change and open to > comments. Please feel free to comment if you have any thoughts. > > Cheers, > Robin >
