Repository: spark Updated Branches: refs/heads/branch-1.5 18b3d11f7 -> 5de0ffbd0
[SPARK-7707] User guide and example code for KernelDensity Author: Sandy Ryza <[email protected]> Closes #8230 from sryza/sandy-spark-7707. (cherry picked from commit f9d1a92aa1bac4494022d78559b871149579e6e8) Signed-off-by: Xiangrui Meng <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5de0ffbd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5de0ffbd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5de0ffbd Branch: refs/heads/branch-1.5 Commit: 5de0ffbd0e0aef170171cec8808eb4ec1ba79b0f Parents: 18b3d11 Author: Sandy Ryza <[email protected]> Authored: Mon Aug 17 17:57:51 2015 -0700 Committer: Xiangrui Meng <[email protected]> Committed: Mon Aug 17 17:58:06 2015 -0700 ---------------------------------------------------------------------- docs/mllib-statistics.md | 77 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/5de0ffbd/docs/mllib-statistics.md ---------------------------------------------------------------------- diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md index be04d0b..80a9d06 100644 --- a/docs/mllib-statistics.md +++ b/docs/mllib-statistics.md @@ -528,5 +528,82 @@ u = RandomRDDs.uniformRDD(sc, 1000000L, 10) v = u.map(lambda x: 1.0 + 2.0 * x) {% endhighlight %} </div> +</div> + +## Kernel density estimation + +[Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique +useful for visualizing empirical probability distributions without requiring assumptions about the +particular distribution that the observed samples are drawn from. It computes an estimate of the +probability density function of a random variables, evaluated at a given set of points. It achieves +this estimate by expressing the PDF of the empirical distribution at a particular point as the the +mean of PDFs of normal distributions centered around each of the samples. + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> +[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight scala %} +import org.apache.spark.mllib.stat.KernelDensity +import org.apache.spark.rdd.RDD + +val data: RDD[Double] = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +val kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0) + +// Find density estimates for the given values +val densities = kd.estimate(Array(-1.0, 2.0, 5.0)) +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight java %} +import org.apache.spark.mllib.stat.KernelDensity; +import org.apache.spark.rdd.RDD; + +RDD<Double> data = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +KernelDensity kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0); + +// Find density estimates for the given values +double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0}); +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> +[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight python %} +from pyspark.mllib.stat import KernelDensity + +data = ... # an RDD of sample data + +# Construct the density estimator with the sample data and a standard deviation for the Gaussian +# kernels +kd = KernelDensity() +kd.setSample(data) +kd.setBandwidth(3.0) + +# Find density estimates for the given values +densities = kd.estimate([-1.0, 2.0, 5.0]) +{% endhighlight %} +</div> </div> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
