[ 
https://issues.apache.org/jira/browse/BEAM-12181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470050#comment-17470050
 ] 

Brian Hulette edited comment on BEAM-12181 at 1/6/22, 5:04 PM:
---------------------------------------------------------------

I looked into this approach a little bit, it's discussed in the wikipedia 
articale: https://en.wikipedia.org/wiki/Mode_(statistics)#Mode_of_a_sample

{quote}In order to estimate the mode of the underlying distribution, the usual 
practice is to discretize the data by assigning frequency values to intervals 
of equal distance, as for making a histogram, effectively replacing the values 
by the midpoints of the intervals they are assigned to. The mode is then the 
value where the histogram reaches its peak. *For small or middle-sized samples 
the outcome of this procedure is sensitive to the choice of interval width if 
chosen too narrow or too wide*; typically one should have a sizable fraction of 
the data concentrated in a relatively small number of intervals (5 to 10), 
while the fraction of the data falling outside these intervals is also sizable. 
An alternate approach is kernel density estimation, which essentially blurs 
point samples to produce a continuous estimate of the probability density 
function which can provide an estimate of the mode.{quote}

(emphasis mine)

I think we could do this discretization in a distributed way, but how would we 
select the number of bins to use to minimize the error? It might be worth 
looking into the kernel density estimation approach.


was (Author: bhulette):
I looked into this approach a little bit, it's discussed in the wikipedia 
articale: https://en.wikipedia.org/wiki/Mode_(statistics)#Mode_of_a_sample

{quote}
In order to estimate the mode of the underlying distribution, the usual 
practice is to discretize the data by assigning frequency values to intervals 
of equal distance, as for making a histogram, effectively replacing the values 
by the midpoints of the intervals they are assigned to. The mode is then the 
value where the histogram reaches its peak. *For small or middle-sized samples 
the outcome of this procedure is sensitive to the choice of interval width if 
chosen too narrow or too wide*; typically one should have a sizable fraction of 
the data concentrated in a relatively small number of intervals (5 to 10), 
while the fraction of the data falling outside these intervals is also sizable. 
An alternate approach is kernel density estimation, which essentially blurs 
point samples to produce a continuous estimate of the probability density 
function which can provide an estimate of the mode.
{quote}

(emphasis mine)

I think we could do this discretization in a distributed way, but how would we 
select the number of bins to use to minimize the error? It might be worth 
looking into the kernel density estimation approach.

> Implement parallelized (approximate) mode
> -----------------------------------------
>
>                 Key: BEAM-12181
>                 URL: https://issues.apache.org/jira/browse/BEAM-12181
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe, sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Svetak Vihaan Sundhar
>            Priority: P3
>              Labels: dataframe-api
>
> Currently we require Singleton partitioning to compute mode(). We should 
> provide an option to compute approximate mode() which can be parallelized.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to