Lovasoa created SPARK-21057:
-------------------------------

             Summary: Do not use a PascalDistribution in countApprox
                 Key: SPARK-21057
                 URL: https://issues.apache.org/jira/browse/SPARK-21057
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.1.1
            Reporter: Lovasoa


I was reading the source of Spark, and found this:
https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72

This is the function that estimates the probability distribution of the total 
count of elements in an RDD given the count of only some partitions.

This function does a strange thing: when the number of elements counted so far 
is less than 10 000, it models the total count with a negative binomial 
(Pascal) law, else, it models it with a Poisson law.

Modeling our number of uncounted elements with a negative binomial law is like 
saying that we ran over elements, counting only some, and stopping after having 
counted a given number of elements.
But this does not model what really happened.  Our counting was limited in 
time, not in number of counted elements.

I propose to use the Poisson distribution in every case, as it can be justified 
under the hypothesis that the number of elements in each partition is 
independent and follows a Poisson law.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to