Monica Song created BEAM-10824:
----------------------------------
Summary: Hash in stats.ApproximateUniqueCombineFn NON-deterministic
Key: BEAM-10824
URL: https://issues.apache.org/jira/browse/BEAM-10824
Project: Beam
Issue Type: Bug
Components: beam-model
Reporter: Monica Song
The python hash() function is non-deterministic. As a result, different workers
will map identical values to different hashes. This leads to overestimation of
the number of unique values (by several magnitudes, in my experience x1000) in
a distributed processing model.
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L218]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)