[ 
https://issues.apache.org/jira/browse/BEAM-12181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498380#comment-17498380
 ] 

Brian Hulette edited comment on BEAM-12181 at 2/26/22, 2:04 AM:
----------------------------------------------------------------

> IIUC, we can calculate the KDE for partitions of our dataset then combine all 
> the kernel estimator values.

Yes agreed! This sounds like the "parallel KDE at sample level" approach.

The other bit we'd need to figure out is bandwidth selection. The paper 
discusses a parallel-implementation of least-squares cross-validation (LSCV) 
which is mentioned in the [scipy 
implementation|https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde],
 but they don't seem to actually implement it. Scipy just supports "scott's 
method" and "silverman's method", which both just seem to be proportional to 
the number of elements, n. It's probably easiest to just use one of those for 
now since measuring n is something we can do easily.

Also for a first draft we could just punt on bandwidth selection and only 
support specifying it as a constant, which scipy also allows.

Also also, we might consider providing our own kde() method that does the 
kernel density estimation (and building approximate mode on that). Pandas 
doesn't have this, but it  does have series.plot.kde. Probably pandas doesn't 
have it just because it's easy enough for their users to use the scipy one.


was (Author: bhulette):
> IIUC, we can calculate the KDE for partitions of our dataset then combine all 
> the kernel estimator values.

Yes agreed! This sounds like the "parallel KDE at sample level" approach.

The other bit we'd need to figure out is bandwidth selection. The paper 
discusses a parallel-implementation of least-squares cross-validation (LSCV) 
which is mentioned in the [scipy 
implementation|https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde],
 but they don't seem to actually implement it. Scipy just supports "scott's 
method" and "silverman's method", which both just seem to be proportional to 
the number of elements, n. It's probably easiest to just use one of those for 
now since measuring n is something we can do easily.

Also for a first draft we could just punt on bandwidth selection and only 
support specifying it as a constant, which scipy also allows.

> Implement parallelized (approximate) mode
> -----------------------------------------
>
>                 Key: BEAM-12181
>                 URL: https://issues.apache.org/jira/browse/BEAM-12181
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe, sdk-py-core
>            Reporter: Brian Hulette
>            Priority: P3
>              Labels: dataframe-api
>
> Currently we require Singleton partitioning to compute mode(). We should 
> provide an option to compute approximate mode() which can be parallelized.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to