Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21513#discussion_r194244592 --- Diff: python/pyspark/ml/clustering.py --- @@ -1156,6 +1157,213 @@ def getKeepLastCheckpoint(self): return self.getOrDefault(self.keepLastCheckpoint) +@inherit_doc +class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, JavaMLReadable, + JavaMLWritable): + """ + .. note:: Experimental + + Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + <a href=http://www.icml2010.org/papers/387.pdf>Lin and Cohen</a>. From the abstract: + PIC finds a very low-dimensional embedding of a dataset using truncated power + iteration on a normalized pair-wise similarity matrix of the data. + + This class is not yet an Estimator/Transformer, use :py:func:`assignClusters` method + to run the PowerIterationClustering algorithm. + + .. seealso:: `Wikipedia on Spectral clustering \ + <http://en.wikipedia.org/wiki/Spectral_clustering>`_ + + >>> data = [((long)(1), (long)(0), 0.5), \ --- End diff -- Users do not know that we make `long` an alias of `int` in PySpark + Python 3. I think in both Py2 and 3, PySpark infers Python int/long as long type in DataFrame. Could you help verify? If that is the case, we can drop `(long)(...)` here. If not, we can cast the columns to long type after creating the DataFrame.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org