[ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=242897&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242897
 ]

ASF GitHub Bot logged work on BEAM-6693:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 15/May/19 22:15
            Start Date: 15/May/19 22:15
    Worklog Time Spent: 10m 
      Work Description: Hannah-Jiang commented on pull request #8535: 
[BEAM-6693] ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r284472831
 
 

 ##########
 File path: sdks/python/setup.py
 ##########
 @@ -125,6 +125,7 @@ def get_version():
     'pyvcf>=0.6.8,<0.7.0; python_version < "3.0"',
     'pyyaml>=3.12,<4.0.0',
     'typing>=3.6.0,<3.7.0; python_version < "3.5.0"',
+    'mmh3>=2.5.1; python_version >= "2.7"',
 
 Review comment:
   Thanks for asking this question. I was able to dig into more and found an 
unexpected reason.
   From 3.4, Python used SpiHash algorithm, before that, it used FNV algorithm. 
Though SpiHash reduced collision, I don't think it's noticeable from my test. 
So I saw how hash values are distributed. Graphs at 
https://docs.google.com/spreadsheets/d/1QpNAorUhUY1Nq3b4QRMfbB9Rupq3ZC9FshunIyRRfH0/edit#gid=0
 shows how hash values are distributed with different hash algorithm given 
exact the same data set. Builtin hash function with Python 2 is not as 
uniformly distributed as other hash functions (builtin hash function with 
Python 3 and mmh3). Though it's not obvious from the graphs that hash values 
from mmh3 algorithms are more uniformed distributed than the ones from builtin 
hash function with python 3, test results shows that mmh3 has more uniformly 
distributed hash values.
   
   Why does it matter to approximate unique count? We are calculating sample 
space with hash values and calculate population space from the sample space. 
More uniformly distributed hash values give us more accurate sample space thus 
we can calculate more accurate population space. The difference is more obvious 
when we have more duplicate records with input.
   
   A test case is generating 10000 elements within range of [0, 1000]. This 
test consistently fails if we use builtin hash function with py2.7, and fails 
sometimes with builtin hash function with py3 and hasn't fail with mmh3.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 242897)
    Time Spent: 5h 50m  (was: 5h 40m)

> ApproximateUnique transform for Python SDK
> ------------------------------------------
>
>                 Key: BEAM-6693
>                 URL: https://issues.apache.org/jira/browse/BEAM-6693
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-py-core
>            Reporter: Ahmet Altay
>            Assignee: Hannah Jiang
>            Priority: Minor
>          Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Add a PTransform for estimating the number of distinct elements in a 
> PCollection and the number of distinct values associated with each key in a 
> PCollection KVs.
> it should offer the same API as its Java counterpart: 
> https://github.com/apache/beam/blob/11a977b8b26eff2274d706541127c19dc93131a2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to