Lovasoa created SPARK-21057:
-------------------------------
Summary: Do not use a PascalDistribution in countApprox
Key: SPARK-21057
URL: https://issues.apache.org/jira/browse/SPARK-21057
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.1.1
Reporter: Lovasoa
I was reading the source of Spark, and found this:
https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72
This is the function that estimates the probability distribution of the total
count of elements in an RDD given the count of only some partitions.
This function does a strange thing: when the number of elements counted so far
is less than 10 000, it models the total count with a negative binomial
(Pascal) law, else, it models it with a Poisson law.
Modeling our number of uncounted elements with a negative binomial law is like
saying that we ran over elements, counting only some, and stopping after having
counted a given number of elements.
But this does not model what really happened. Our counting was limited in
time, not in number of counted elements.
I propose to use the Poisson distribution in every case, as it can be justified
under the hypothesis that the number of elements in each partition is
independent and follows a Poisson law.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]