Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21513#discussion_r194244698 --- Diff: python/pyspark/ml/clustering.py --- @@ -1156,6 +1157,213 @@ def getKeepLastCheckpoint(self): return self.getOrDefault(self.keepLastCheckpoint) +@inherit_doc +class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, JavaMLReadable, + JavaMLWritable): + """ + .. note:: Experimental + + Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + <a href=http://www.icml2010.org/papers/387.pdf>Lin and Cohen</a>. From the abstract: + PIC finds a very low-dimensional embedding of a dataset using truncated power + iteration on a normalized pair-wise similarity matrix of the data. + + This class is not yet an Estimator/Transformer, use :py:func:`assignClusters` method + to run the PowerIterationClustering algorithm. + + .. seealso:: `Wikipedia on Spectral clustering \ + <http://en.wikipedia.org/wiki/Spectral_clustering>`_ + + >>> data = [((long)(1), (long)(0), 0.5), \ + ((long)(2), (long)(0), 0.5), \ + ((long)(2), (long)(1), 0.7), \ + ((long)(3), (long)(0), 0.5), \ + ((long)(3), (long)(1), 0.7), \ + ((long)(3), (long)(2), 0.9), \ + ((long)(4), (long)(0), 0.5), \ + ((long)(4), (long)(1), 0.7), \ + ((long)(4), (long)(2), 0.9), \ + ((long)(4), (long)(3), 1.1), \ + ((long)(5), (long)(0), 0.5), \ + ((long)(5), (long)(1), 0.7), \ + ((long)(5), (long)(2), 0.9), \ + ((long)(5), (long)(3), 1.1), \ + ((long)(5), (long)(4), 1.3)] + >>> df = spark.createDataFrame(data).toDF("src", "dst", "weight") + >>> pic = PowerIterationClustering() + >>> assignments = pic.setK(2).setMaxIter(40).setWeightCol("weight").assignClusters(df) + >>> assignments.sort(assignments.id).show(truncate=False) + +---+-------+ + |id |cluster| + +---+-------+ + |0 |1 | + |1 |1 | + |2 |1 | + |3 |1 | + |4 |1 | + |5 |0 | + +---+-------+ + ... + >>> pic_path = temp_path + "/pic" + >>> pic.save(pic_path) + >>> pic2 = PowerIterationClustering.load(pic_path) + >>> pic2.getK() + 2 + >>> pic2.getMaxIter() + 40 + >>> assignments2 = pic2.assignClusters(df) --- End diff -- This and `pic3` seem unnecessary to me as doctest. Same for pic3. Doctest is mainly to provide examples, not a full suite of unit tests.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org