[jira] [Commented] (BEAM-12181) Implement parallelized (approximate) mode

Brian Hulette (Jira) Fri, 25 Feb 2022 18:01:04 -0800


    [ 
https://issues.apache.org/jira/browse/BEAM-12181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498380#comment-17498380
 ]


Brian Hulette commented on BEAM-12181:
--------------------------------------

> IIUC, we can calculate the KDE for partitions of our dataset then combine all 
> the kernel estimator values.

Yes agreed! This sounds like the "parallel KDE at sample level" approach.

The other bit we'd need to figure out is bandwidth selection. The paper 
discusses a parallel-implementation of least-squares cross-validation (LSCV) 
which is mentioned in the [scipy 
implementation|https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde],
 but they don't seem to actually implement it. Scipy just supports "scott's 
method" and "silverman's method", which both just seem to be proportional to 
the number of elements, n. It's probably easiest to just use one of those for 
now since measuring n is something we can do easily.

> Implement parallelized (approximate) mode
> -----------------------------------------
>
>                 Key: BEAM-12181
>                 URL: https://issues.apache.org/jira/browse/BEAM-12181
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe, sdk-py-core
>            Reporter: Brian Hulette
>            Priority: P3
>              Labels: dataframe-api
>
> Currently we require Singleton partitioning to compute mode(). We should 
> provide an option to compute approximate mode() which can be parallelized.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-12181) Implement parallelized (approximate) mode

Reply via email to