Monica Song created BEAM-10824:
----------------------------------

             Summary: Hash in stats.ApproximateUniqueCombineFn NON-deterministic
                 Key: BEAM-10824
                 URL: https://issues.apache.org/jira/browse/BEAM-10824
             Project: Beam
          Issue Type: Bug
          Components: beam-model
            Reporter: Monica Song


The python hash() function is non-deterministic. As a result, different workers 
will map identical values to different hashes. This leads to overestimation of 
the number of unique values (by several magnitudes, in my experience x1000) in 
a distributed processing model. 

[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L218]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to