Thanks Robin! It would also be interesting if we could offer HLL_COUNT
functions in BeamSQL based on your proposal!


-Rui

On Mon, Jun 24, 2019 at 10:47 AM Robin Qiu <[email protected]> wrote:

> Hi all,
>
> I have written a doc
> <https://docs.google.com/document/d/1U5aXdC9lDSOqT6FPHRulp-EutYiQ9KeHpgu-19CIfEI>
> proposing we integrate the HyperLogLog++ algorithm
> <http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf>
> into Beam as a new combiner. The algorithm solves the count-distinct
> problem <https://en.wikipedia.org/wiki/Count-distinct_problem>, and the
> intermediate sketch (aggregator) format will be compatible with sketches
> computed via the HLL_COUNT functions
> <https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions>
> in Google Cloud BigQuery (because they will be based on the same
> implementation: ZetaSketch <https://github.com/google/zetasketch>). The
> tracking JIRA issue is BEAM-7013
> <https://issues.apache.org/jira/browse/BEAM-7013>.
>
> The API design proposed in the doc is subject to change and open to
> comments. Please feel free to comment if you have any thoughts.
>
> Cheers,
> Robin
>

Reply via email to