spark git commit: [SPARK-5879][MLLIB] update PIC user guide and add a Java example

meng Wed, 18 Feb 2015 16:30:03 -0800

Repository: spark
Updated Branches:
  refs/heads/master aa8f10e82 -> d12d2ad76



[SPARK-5879][MLLIB] update PIC user guide and add a Java example

Updated PIC user guide to reflect API changes and added a simple Java example. 
The API is still not very Java-friendly. I created SPARK-5990 for this issue.

Author: Xiangrui Meng <m...@databricks.com>

Closes #4680 from mengxr/SPARK-5897 and squashes the following commits:

847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d12d2ad7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d12d2ad7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d12d2ad7

Branch: refs/heads/master
Commit: d12d2ad76ee673b819c92dd8093ba0a560847761
Parents: aa8f10e
Author: Xiangrui Meng <m...@databricks.com>
Authored: Wed Feb 18 16:29:32 2015 -0800
Committer: Xiangrui Meng <m...@databricks.com>
Committed: Wed Feb 18 16:29:32 2015 -0800

----------------------------------------------------------------------
 .../PIClusteringFiveCirclesInputsAndOutputs.png | Bin 249245 -> 0 bytes
 docs/mllib-clustering.md                        |  95 ++++++++++++++++---
 .../JavaPowerIterationClusteringExample.java    |  58 +++++++++++
 .../clustering/PowerIterationClustering.scala   |   9 ++
 4 files changed, 149 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/d12d2ad7/docs/img/PIClusteringFiveCirclesInputsAndOutputs.png
----------------------------------------------------------------------
diff --git a/docs/img/PIClusteringFiveCirclesInputsAndOutputs.png 
b/docs/img/PIClusteringFiveCirclesInputsAndOutputs.png
deleted file mode 100644
index ed9adad..0000000
Binary files a/docs/img/PIClusteringFiveCirclesInputsAndOutputs.png and 
/dev/null differ

http://git-wip-us.apache.org/repos/asf/spark/blob/d12d2ad7/docs/mllib-clustering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 09b5657..6e46a47 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -270,23 +270,92 @@ for i in range(2):
 
 ## Power iteration clustering (PIC)
 
-Power iteration clustering (PIC) is a scalable and efficient algorithm for 
clustering points given pointwise mutual affinity values.  Internally the 
algorithm:
+Power iteration clustering (PIC) is a scalable and efficient algorithm for 
clustering vertices of a
+graph given pairwise similarties as edge properties,
+described in [Lin and Cohen, Power Iteration 
Clustering](http://www.icml2010.org/papers/387.pdf).
+It computes a pseudo-eigenvector of the normalized affinity matrix of the 
graph via
+[power iteration](http://en.wikipedia.org/wiki/Power_iteration)  and uses it 
to cluster vertices.
+MLlib includes an implementation of PIC using GraphX as its backend.
+It takes an `RDD` of `(srcId, dstId, similarity)` tuples and outputs a model 
with the clustering assignments.
+The similarities must be nonnegative.
+PIC assumes that the similarity measure is symmetric.
+A pair `(srcId, dstId)` regardless of the ordering should appear at most once 
in the input data.
+If a pair is missing from input, their similarity is treated as zero.
+MLlib's PIC implementation takes the following (hyper-)parameters:
+
+* `k`: number of clusters
+* `maxIterations`: maximum number of power iterations
+* `initializationMode`: initialization model. This can be either "random", 
which is the default,
+  to use a random vector as vertex properties, or "degree" to use normalized 
sum similarities.
 
-* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that 
represents a  normalized pairwise affinity between all input points.
-* calculates the principal eigenvalue and eigenvector
-* Clusters each of the input points according to their principal eigenvector 
component value
+**Examples**
+
+In the following, we show code snippets to demonstrate how to use PIC in MLlib.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+[`PowerIterationClustering`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering)
 
+implements the PIC algorithm.
+It takes an `RDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples 
representing the
+affinity matrix.
+Calling `PowerIterationClustering.run` returns a
+[`PowerIterationClusteringModel`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel),
+which contains the computed clustering assignments.
 
-Details of this algorithm are found within [Power Iteration Clustering, Lin 
and Cohen]{www.icml2010.org/papers/387.pdf}
+{% highlight scala %}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.mllib.linalg.Vectors
 
-Example outputs for a dataset inspired by the paper - but with five clusters 
instead of three- have he following output from our implementation:
+val similarities: RDD[(Long, Long, Double)] = ...
+
+val pic = new PowerIteartionClustering()
+  .setK(3)
+  .setMaxIterations(20)
+val model = pic.run(similarities)
+
+model.assignments.foreach { case (vertexId, clusterId) =>
+  println(s"$vertexId -> $clusterId")
+}
+{% endhighlight %}
+
+A full example that produces the experiment described in the PIC paper can be 
found under
+[`examples/`](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala).
+
+</div>
 
-<p style="text-align: center;">
-  <img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
-       title="The Property Graph"
-       alt="The Property Graph"
-       width="50%" />
-  <!-- Images are downsized intentionally to improve quality on retina 
displays -->
-</p>
+<div data-lang="java" markdown="1">
+
+[`PowerIterationClustering`](api/java/org/apache/spark/mllib/clustering/PowerIterationClustering.html)
+implements the PIC algorithm.
+It takes an `JavaRDD` of `(srcId: Long, dstId: Long, similarity: Double)` 
tuples representing the
+affinity matrix.
+Calling `PowerIterationClustering.run` returns a
+[`PowerIterationClusteringModel`](api/java/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html)
+which contains the computed clustering assignments.
+
+{% highlight java %}
+import scala.Tuple2;
+import scala.Tuple3;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.mllib.clustering.PowerIterationClustering;
+import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;
+
+JavaRDD<Tuple3<Long, Long, Double>> similarities = ...
+
+PowerIterationClustering pic = new PowerIterationClustering()
+  .setK(2)
+  .setMaxIterations(10);
+PowerIterationClusteringModel model = pic.run(similarities);
+
+for (Tuple2<Object, Object> assignment: 
model.assignments().toJavaRDD().collect()) {
+  System.out.println(assignment._1() + " -> " + assignment._2());
+}
+{% endhighlight %}
+</div>
+
+</div>
 
 ## Latent Dirichlet allocation (LDA)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d12d2ad7/examples/src/main/java/org/apache/spark/examples/mllib/JavaPowerIterationClusteringExample.java
----------------------------------------------------------------------
diff --git 
a/examples/src/main/java/org/apache/spark/examples/mllib/JavaPowerIterationClusteringExample.java
 
b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPowerIterationClusteringExample.java
new file mode 100644
index 0000000..e9371de
--- /dev/null
+++ 
b/examples/src/main/java/org/apache/spark/examples/mllib/JavaPowerIterationClusteringExample.java
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib;
+
+import scala.Tuple2;
+import scala.Tuple3;
+
+import com.google.common.collect.Lists;
+
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.clustering.PowerIterationClustering;
+import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;
+
+/**
+ * Java example for graph clustering using power iteration clustering (PIC).
+ */
+public class JavaPowerIterationClusteringExample {
+  public static void main(String[] args) {
+    SparkConf sparkConf = new 
SparkConf().setAppName("JavaPowerIterationClusteringExample");
+    JavaSparkContext sc = new JavaSparkContext(sparkConf);
+
+    @SuppressWarnings("unchecked")
+    JavaRDD<Tuple3<Long, Long, Double>> similarities = 
sc.parallelize(Lists.newArrayList(
+      new Tuple3<Long, Long, Double>(0L, 1L, 0.9),
+      new Tuple3<Long, Long, Double>(1L, 2L, 0.9),
+      new Tuple3<Long, Long, Double>(2L, 3L, 0.9),
+      new Tuple3<Long, Long, Double>(3L, 4L, 0.1),
+      new Tuple3<Long, Long, Double>(4L, 5L, 0.9)));
+
+    PowerIterationClustering pic = new PowerIterationClustering()
+      .setK(2)
+      .setMaxIterations(10);
+    PowerIterationClusteringModel model = pic.run(similarities);
+
+    for (Tuple2<Object, Object> assignment: 
model.assignments().toJavaRDD().collect()) {
+      System.out.println(assignment._1() + " -> " + assignment._2());
+    }
+
+    sc.stop();
+  }
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/d12d2ad7/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
index 3b1caf0..63d0334 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.mllib.clustering
 
+import org.apache.spark.api.java.JavaRDD
 import org.apache.spark.{Logging, SparkException}
 import org.apache.spark.annotation.Experimental
 import org.apache.spark.graphx._
@@ -116,6 +117,14 @@ class PowerIterationClustering private[clustering] (
   }
 
   /**
+   * A Java-friendly version of [[PowerIterationClustering.run]].
+   */
+  def run(similarities: JavaRDD[(java.lang.Long, java.lang.Long, 
java.lang.Double)])
+    : PowerIterationClusteringModel = {
+    run(similarities.rdd.asInstanceOf[RDD[(Long, Long, Double)]])
+  }
+
+  /**
    * Runs the PIC algorithm.
    *
    * @param w The normalized affinity matrix, which is the matrix W in the PIC 
paper with


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-5879][MLLIB] update PIC user guide and add a Java example

Reply via email to