[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

jkbradley Tue, 12 Jan 2016 10:36:18 -0800

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10150#discussion_r49494004
  
    --- Diff: python/pyspark/mllib/clustering.py ---
    @@ -38,13 +38,129 @@
     from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
     from pyspark.streaming import DStream
     
    -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
    -           'PowerIterationClusteringModel', 'PowerIterationClustering',
    -           'StreamingKMeans', 'StreamingKMeansModel',
    +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
    +           'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
    +           'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
                'LDA', 'LDAModel']
     
     
     @inherit_doc
    +class BisectingKMeansModel(JavaModelWrapper):
    +    """
    +    .. note:: Experimental
    +
    +    A clustering model derived from the bisecting k-means method.
    +
    +    >>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
    +    >>> bskm = BisectingKMeans()
    +    >>> model = bskm.train(sc.parallelize(data, 2), k=4)
    +    >>> p = array([0.0, 0.0])
    +    >>> model.predict(p)
    +    0
    +    >>> model.k
    +    4
    +    >>> model.computeCost(p)
    +    0.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    def __init__(self, java_model):
    +        super(BisectingKMeansModel, self).__init__(java_model)
    +        self.centers = [c.toArray() for c in self.call("clusterCenters")]
    +
    +    @property
    +    @since('2.0.0')
    +    def clusterCenters(self):
    +        """Get the cluster centers, represented as a list of NumPy
    +        arrays."""
    +        return self.centers
    +
    +    @property
    +    @since('2.0.0')
    +    def k(self):
    +        """Get the number of clusters"""
    +        return self.call("k")
    +
    +    @since('2.0.0')
    +    def predict(self, x):
    +        """
    +        Find the cluster that each of the points belongs to in this
    +        model.
    +
    +        :param x: the point (or RDD of points) to determine
    +          compute the clusters for.
    +        """
    +        if isinstance(x, RDD):
    +            vecs = x.map(_convert_to_vector)
    +            return self.call("predict", vecs)
    +
    +        x = _convert_to_vector(x)
    +        return self.call("predict", x)
    +
    +    @since('2.0.0')
    +    def computeCost(self, x):
    +        """
    +        Return the Bisecting K-means cost (sum of squared distances of
    +        points to their nearest center) for this model on the given
    +        data. If provided with an RDD of points returns the sum.
    +
    +        :param point: the point or RDD of points to compute the cost(s).
    +        """
    +        if isinstance(x, RDD):
    +            vecs = x.map(_convert_to_vector)
    +            return self.call("computeCost", vecs)
    +
    +        return self.call("computeCost", _convert_to_vector(x))
    +
    +
    +class BisectingKMeans(object):
    +    """
    +    .. note:: Experimental
    +
    +    A bisecting k-means algorithm based on the paper "A comparison of
    +    document clustering techniques" by Steinbach, Karypis, and Kumar,
    +    with modification to fit Spark.
    +    The algorithm starts from a single cluster that contains all points.
    +    Iteratively it finds divisible clusters on the bottom level and
    +    bisects each of them using k-means, until there are `k` leaf
    +    clusters in total or no leaf clusters are divisible.
    +    The bisecting steps of clusters on the same level are grouped
    +    together to increase parallelism. If bisecting all divisible
    +    clusters on the bottom level would result more than `k` leaf
    +    clusters, larger clusters get higher priority.
    +
    +    Based on U{http://bit.ly/1OTnFP1} Steinbach, Karypis, and Kumar, A
    --- End diff --
    
    I'd prefer to keep the original link.  Bitly links might not make people 
happy since it's less clear what you're linking to.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

Reply via email to