[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

mengxr Sat, 09 Jun 2018 21:14:55 -0700

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21513#discussion_r194244592
  
    --- Diff: python/pyspark/ml/clustering.py ---
    @@ -1156,6 +1157,213 @@ def getKeepLastCheckpoint(self):
             return self.getOrDefault(self.keepLastCheckpoint)
     
     
    +@inherit_doc
    +class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, 
JavaMLReadable,
    +                               JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
    +    <a href=http://www.icml2010.org/papers/387.pdf>Lin and Cohen</a>. From 
the abstract:
    +    PIC finds a very low-dimensional embedding of a dataset using 
truncated power
    +    iteration on a normalized pair-wise similarity matrix of the data.
    +
    +    This class is not yet an Estimator/Transformer, use 
:py:func:`assignClusters` method
    +    to run the PowerIterationClustering algorithm.
    +
    +    .. seealso:: `Wikipedia on Spectral clustering \
    +    <http://en.wikipedia.org/wiki/Spectral_clustering>`_
    +
    +   >>> data = [((long)(1), (long)(0), 0.5), \
    --- End diff --
    
    Users do not know that we make `long` an alias of `int` in PySpark + Python 
3.
    
    I think in both Py2 and 3, PySpark infers Python int/long as long type in 
DataFrame. Could you help verify? If that is the case, we can drop 
`(long)(...)` here. If not, we can cast the columns to long type after creating 
the DataFrame.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

Reply via email to