[
https://issues.apache.org/jira/browse/BEAM-12181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498380#comment-17498380
]
Brian Hulette commented on BEAM-12181:
--------------------------------------
> IIUC, we can calculate the KDE for partitions of our dataset then combine all
> the kernel estimator values.
Yes agreed! This sounds like the "parallel KDE at sample level" approach.
The other bit we'd need to figure out is bandwidth selection. The paper
discusses a parallel-implementation of least-squares cross-validation (LSCV)
which is mentioned in the [scipy
implementation|https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde],
but they don't seem to actually implement it. Scipy just supports "scott's
method" and "silverman's method", which both just seem to be proportional to
the number of elements, n. It's probably easiest to just use one of those for
now since measuring n is something we can do easily.
> Implement parallelized (approximate) mode
> -----------------------------------------
>
> Key: BEAM-12181
> URL: https://issues.apache.org/jira/browse/BEAM-12181
> Project: Beam
> Issue Type: Improvement
> Components: dsl-dataframe, sdk-py-core
> Reporter: Brian Hulette
> Priority: P3
> Labels: dataframe-api
>
> Currently we require Singleton partitioning to compute mode(). We should
> provide an option to compute approximate mode() which can be parallelized.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)