[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

mengxr Sat, 09 Jun 2018 21:15:01 -0700

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21513#discussion_r194244698
  
    --- Diff: python/pyspark/ml/clustering.py ---
    @@ -1156,6 +1157,213 @@ def getKeepLastCheckpoint(self):
             return self.getOrDefault(self.keepLastCheckpoint)
     
     
    +@inherit_doc
    +class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, 
JavaMLReadable,
    +                               JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
    +    <a href=http://www.icml2010.org/papers/387.pdf>Lin and Cohen</a>. From 
the abstract:
    +    PIC finds a very low-dimensional embedding of a dataset using 
truncated power
    +    iteration on a normalized pair-wise similarity matrix of the data.
    +
    +    This class is not yet an Estimator/Transformer, use 
:py:func:`assignClusters` method
    +    to run the PowerIterationClustering algorithm.
    +
    +    .. seealso:: `Wikipedia on Spectral clustering \
    +    <http://en.wikipedia.org/wiki/Spectral_clustering>`_
    +
    +   >>> data = [((long)(1), (long)(0), 0.5), \
    +               ((long)(2), (long)(0), 0.5), \
    +               ((long)(2), (long)(1), 0.7), \
    +               ((long)(3), (long)(0), 0.5), \
    +               ((long)(3), (long)(1), 0.7), \
    +               ((long)(3), (long)(2), 0.9), \
    +               ((long)(4), (long)(0), 0.5), \
    +               ((long)(4), (long)(1), 0.7), \
    +               ((long)(4), (long)(2), 0.9), \
    +               ((long)(4), (long)(3), 1.1), \
    +               ((long)(5), (long)(0), 0.5), \
    +               ((long)(5), (long)(1), 0.7), \
    +               ((long)(5), (long)(2), 0.9), \
    +               ((long)(5), (long)(3), 1.1), \
    +               ((long)(5), (long)(4), 1.3)]
    +    >>> df = spark.createDataFrame(data).toDF("src", "dst", "weight")
    +    >>> pic = PowerIterationClustering()
    +    >>> assignments = 
pic.setK(2).setMaxIter(40).setWeightCol("weight").assignClusters(df)
    +    >>> assignments.sort(assignments.id).show(truncate=False)
    +    +---+-------+
    +    |id |cluster|
    +    +---+-------+
    +    |0  |1      |
    +    |1  |1      |
    +    |2  |1      |
    +    |3  |1      |
    +    |4  |1      |
    +    |5  |0      |
    +    +---+-------+
    +    ...
    +    >>> pic_path = temp_path + "/pic"
    +    >>> pic.save(pic_path)
    +    >>> pic2 = PowerIterationClustering.load(pic_path)
    +    >>> pic2.getK()
    +    2
    +    >>> pic2.getMaxIter()
    +    40
    +    >>> assignments2 = pic2.assignClusters(df)
    --- End diff --
    
    This and `pic3` seem unnecessary to me as doctest. Same for pic3. Doctest 
is mainly to provide examples, not a full suite of unit tests.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

Reply via email to