Hi everyone;

TL/DR : Discussion on Beam's various Approximate Distinct Count algorithms.

Today there are several options for Approximate Algorithms in Apache Beam
2.16 with HLLCount being the most recently added. Would like to canvas
opinions here on the possibility of rationalizing these API's by removing
obsolete / less efficient implementations.
The current situation:

There are three options available to users: ApproximateUnique.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
ApproximateDistinct.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
and HllCount.java
<https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
A quick summary of these API's as I understand them:

HllCount.java
<https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
Marked as @Experimental

PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data streams
based on the ZetaSketch <https://github.com/google/zetasketch>
implementation.Detailed design of this class, see
https://s.apache.org/hll-in-beam.

ApproximateUnique.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
Not Marked with experimental

This API does not expose the ability to create sketches so it's not
suitable for the OLAP use case that HLL++ is geared towards (do
pre-aggregation into sketches to allow interactive query speeds). It's also
less precise for the same amount of memory used: the error bounds in the
doc comments give :

/* The error is about

{@code 2 * / sqrt(sampleSize)},) */

Compared to the default HLLCount sketch size, its error is 10X larger than
the HLL++ error.

ApproximateDistinct.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
Marked with @Experimental

This is a re-implementation of the HLL++ algorithm, based on the paper
published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
have not run any benchmarks to compare this implementation compared to the
HLLCount and we need to be careful to ensure that if we were to change any
of these API's that the binary format of the sketches should never change,
there could be users who have stored previous sketches using
ApproximateDistinct and it will be important to try and ensure they do not
have a bad experience.


Proposals:

There are two classes of users expected for these algorithms:

1) Users who simply use the transform to estimate the size of their data
set in Beam

2) Users who want to create sketches and store them, either for
interoperability with other systems, or as features to be used in further
data processing.



For use case 1, it is possible to make use of naming which does not expose
the implementation, however for use case 2 it is important for the
implementation to be explicit as sketches produced with one implementation
will not work with other implementations.

ApproximateUnique.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
:

This one does not provide sketches and based on the notes above, is not as
efficient as HLLCount. However it does have a very searchable name and is
likely to be the one users will gravitate to when searching for Approximate
unique algorithms but do not need the capabilities of sketches.

Ideally we should think about switching the implementation of this
transform to wrap HLLCount. However this could mean changing things in a
way which is not transparent to the end developer.  Although as a result of
the change they would get a better implementation for free on an upgrade :-)

Another option would be to mark this transform as @Deprecated and create a
new transform ApproximateCountDistinct which would wrap HLLCount. The name
is also much clearer.

ApproximateDistinct.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>

This transform does generate sketches as output and given its marked as
@Experimental, one option we would have is to create a name which includes
the algorithm implementation details, for example
ApproximateCountDistinctClearSpring.



HllCount.java
<https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
.

Again we have a few options here, as the name does not include search words
like Approximate, we can create a wrapper which has a name similar to
ApproximateCountDistinctZetaSketch.



Thoughts?



Cheers



Reza





-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Reply via email to