[
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=242897&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242897
]
ASF GitHub Bot logged work on BEAM-6693:
----------------------------------------
Author: ASF GitHub Bot
Created on: 15/May/19 22:15
Start Date: 15/May/19 22:15
Worklog Time Spent: 10m
Work Description: Hannah-Jiang commented on pull request #8535:
[BEAM-6693] ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r284472831
##########
File path: sdks/python/setup.py
##########
@@ -125,6 +125,7 @@ def get_version():
'pyvcf>=0.6.8,<0.7.0; python_version < "3.0"',
'pyyaml>=3.12,<4.0.0',
'typing>=3.6.0,<3.7.0; python_version < "3.5.0"',
+ 'mmh3>=2.5.1; python_version >= "2.7"',
Review comment:
Thanks for asking this question. I was able to dig into more and found an
unexpected reason.
From 3.4, Python used SpiHash algorithm, before that, it used FNV algorithm.
Though SpiHash reduced collision, I don't think it's noticeable from my test.
So I saw how hash values are distributed. Graphs at
https://docs.google.com/spreadsheets/d/1QpNAorUhUY1Nq3b4QRMfbB9Rupq3ZC9FshunIyRRfH0/edit#gid=0
shows how hash values are distributed with different hash algorithm given
exact the same data set. Builtin hash function with Python 2 is not as
uniformly distributed as other hash functions (builtin hash function with
Python 3 and mmh3). Though it's not obvious from the graphs that hash values
from mmh3 algorithms are more uniformed distributed than the ones from builtin
hash function with python 3, test results shows that mmh3 has more uniformly
distributed hash values.
Why does it matter to approximate unique count? We are calculating sample
space with hash values and calculate population space from the sample space.
More uniformly distributed hash values give us more accurate sample space thus
we can calculate more accurate population space. The difference is more obvious
when we have more duplicate records with input.
A test case is generating 10000 elements within range of [0, 1000]. This
test consistently fails if we use builtin hash function with py2.7, and fails
sometimes with builtin hash function with py3 and hasn't fail with mmh3.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 242897)
Time Spent: 5h 50m (was: 5h 40m)
> ApproximateUnique transform for Python SDK
> ------------------------------------------
>
> Key: BEAM-6693
> URL: https://issues.apache.org/jira/browse/BEAM-6693
> Project: Beam
> Issue Type: New Feature
> Components: sdk-py-core
> Reporter: Ahmet Altay
> Assignee: Hannah Jiang
> Priority: Minor
> Time Spent: 5h 50m
> Remaining Estimate: 0h
>
> Add a PTransform for estimating the number of distinct elements in a
> PCollection and the number of distinct values associated with each key in a
> PCollection KVs.
> it should offer the same API as its Java counterpart:
> https://github.com/apache/beam/blob/11a977b8b26eff2274d706541127c19dc93131a2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)