Hi all,

I have written a doc
<https://docs.google.com/document/d/1U5aXdC9lDSOqT6FPHRulp-EutYiQ9KeHpgu-19CIfEI>
proposing we integrate the HyperLogLog++ algorithm
<http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf>
into Beam as a new combiner. The algorithm solves the count-distinct problem
<https://en.wikipedia.org/wiki/Count-distinct_problem>, and the
intermediate sketch (aggregator) format will be compatible with sketches
computed via the HLL_COUNT functions
<https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions>
in Google Cloud BigQuery (because they will be based on the same
implementation: ZetaSketch <https://github.com/google/zetasketch>). The
tracking JIRA issue is BEAM-7013
<https://issues.apache.org/jira/browse/BEAM-7013>.

The API design proposed in the doc is subject to change and open to
comments. Please feel free to comment if you have any thoughts.

Cheers,
Robin

Reply via email to