[ 
https://issues.apache.org/jira/browse/BEAM-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186731#comment-17186731
 ] 

Valentyn Tymofieiev commented on BEAM-10824:
--------------------------------------------

Related: https://issues.apache.org/jira/browse/BEAM-7525

We originally used mmh3, but reverted to default hash function without 
realizing the consequences for distributed execution 
https://github.com/apache/beam/pull/8799/.

AFAIK mmh dependency did not install cleanly on some Windows machines, we can 
see whether this is still the case now that we have precommit tests on Windows 
running on every PR.

We can also pick a different hash function that is deterministic. 

> Hash in stats.ApproximateUniqueCombineFn NON-deterministic
> ----------------------------------------------------------
>
>                 Key: BEAM-10824
>                 URL: https://issues.apache.org/jira/browse/BEAM-10824
>             Project: Beam
>          Issue Type: Bug
>          Components: beam-model
>            Reporter: Monica Song
>            Priority: P1
>              Labels: hash
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The python hash() function is non-deterministic. As a result, different 
> workers will map identical values to different hashes. This leads to 
> overestimation of the number of unique values (by several magnitudes, in my 
> experience x1000) in a distributed processing model. 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L218]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to