[
https://issues.apache.org/jira/browse/BEAM-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186731#comment-17186731
]
Valentyn Tymofieiev commented on BEAM-10824:
--------------------------------------------
Related: https://issues.apache.org/jira/browse/BEAM-7525
We originally used mmh3, but reverted to default hash function without
realizing the consequences for distributed execution
https://github.com/apache/beam/pull/8799/.
AFAIK mmh dependency did not install cleanly on some Windows machines, we can
see whether this is still the case now that we have precommit tests on Windows
running on every PR.
We can also pick a different hash function that is deterministic.
> Hash in stats.ApproximateUniqueCombineFn NON-deterministic
> ----------------------------------------------------------
>
> Key: BEAM-10824
> URL: https://issues.apache.org/jira/browse/BEAM-10824
> Project: Beam
> Issue Type: Bug
> Components: beam-model
> Reporter: Monica Song
> Priority: P1
> Labels: hash
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> The python hash() function is non-deterministic. As a result, different
> workers will map identical values to different hashes. This leads to
> overestimation of the number of unique values (by several magnitudes, in my
> experience x1000) in a distributed processing model.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L218]
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)