Hi all, I have written a doc <https://docs.google.com/document/d/1U5aXdC9lDSOqT6FPHRulp-EutYiQ9KeHpgu-19CIfEI> proposing we integrate the HyperLogLog++ algorithm <http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf> into Beam as a new combiner. The algorithm solves the count-distinct problem <https://en.wikipedia.org/wiki/Count-distinct_problem>, and the intermediate sketch (aggregator) format will be compatible with sketches computed via the HLL_COUNT functions <https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions> in Google Cloud BigQuery (because they will be based on the same implementation: ZetaSketch <https://github.com/google/zetasketch>). The tracking JIRA issue is BEAM-7013 <https://issues.apache.org/jira/browse/BEAM-7013>.
The API design proposed in the doc is subject to change and open to comments. Please feel free to comment if you have any thoughts. Cheers, Robin
