Repository: spark
Updated Branches:
  refs/heads/branch-1.5 18b3d11f7 -> 5de0ffbd0


[SPARK-7707] User guide and example code for KernelDensity

Author: Sandy Ryza <[email protected]>

Closes #8230 from sryza/sandy-spark-7707.

(cherry picked from commit f9d1a92aa1bac4494022d78559b871149579e6e8)
Signed-off-by: Xiangrui Meng <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5de0ffbd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5de0ffbd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5de0ffbd

Branch: refs/heads/branch-1.5
Commit: 5de0ffbd0e0aef170171cec8808eb4ec1ba79b0f
Parents: 18b3d11
Author: Sandy Ryza <[email protected]>
Authored: Mon Aug 17 17:57:51 2015 -0700
Committer: Xiangrui Meng <[email protected]>
Committed: Mon Aug 17 17:58:06 2015 -0700

----------------------------------------------------------------------
 docs/mllib-statistics.md | 77 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/5de0ffbd/docs/mllib-statistics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index be04d0b..80a9d06 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -528,5 +528,82 @@ u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
 v = u.map(lambda x: 1.0 + 2.0 * x)
 {% endhighlight %}
 </div>
+</div>
+
+## Kernel density estimation
+
+[Kernel density 
estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a 
technique
+useful for visualizing empirical probability distributions without requiring 
assumptions about the
+particular distribution that the observed samples are drawn from. It computes 
an estimate of the
+probability density function of a random variables, evaluated at a given set 
of points. It achieves
+this estimate by expressing the PDF of the empirical distribution at a 
particular point as the the
+mean of PDFs of normal distributions centered around each of the samples.
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight scala %}
+import org.apache.spark.mllib.stat.KernelDensity
+import org.apache.spark.rdd.RDD
+
+val data: RDD[Double] = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+// kernels
+val kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0)
+
+// Find density estimates for the given values
+val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight java %}
+import org.apache.spark.mllib.stat.KernelDensity;
+import org.apache.spark.rdd.RDD;
+
+RDD<Double> data = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+// kernels
+KernelDensity kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0);
+
+// Find density estimates for the given values
+double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight python %}
+from pyspark.mllib.stat import KernelDensity
+
+data = ... # an RDD of sample data
+
+# Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+# kernels
+kd = KernelDensity()
+kd.setSample(data)
+kd.setBandwidth(3.0)
+
+# Find density estimates for the given values
+densities = kd.estimate([-1.0, 2.0, 5.0])
+{% endhighlight %}
+</div>
 
 </div>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to