[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

mengxr Tue, 28 Oct 2014 10:57:07 -0700

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2942#discussion_r19490241
  
    --- Diff: docs/mllib-clustering.md ---
    @@ -153,3 +153,75 @@ provided in the [Self-Contained 
Applications](quick-start.html#self-contained-ap
     section of the Spark
     Quick Start guide. Be sure to also include *spark-mllib* to your build 
file as
     a dependency.
    +
    +## Streaming clustering
    +
    +When data arrive in a stream, we may want to estimate clusters 
dynamically, updating them as new data arrive. MLlib provides support for 
streaming KMeans clustering, with parameters to control the decay (or 
"forgetfulness") of the estimates. The algorithm uses a generalization of the 
mini-batch KMeans update rule. For each batch of data, we assign all points to 
their nearest cluster, compute new cluster centers, then update each cluster 
using:
    +
    +`\begin{equation}
    +    c_{t+1} = \frac{c_tn_t\alpha + x_tm_t}{n_t\alpha+m_t}
    +\end{equation}`
    +`\begin{equation}
    +    n_{t+1} = n_t + m_t  
    +\end{equation}`
    +
    +Where `$c_t$` is the previous center for the cluster, `$n_t$` is the 
number of points assigned to the cluster thus far, `$x_t$` is the new cluster 
center from the current batch, and `$m_t$` is the number of points added to the 
cluster in the current batch. The decay factor `$\alpha$` can be used to ignore 
the past: with `$\alpha$=1` all data will be used from the beginning; with 
`$\alpha$=0` only the most recent data will be used. This is analogous to an 
expontentially-weighted moving average.
    --- End diff --
    
    1. line too wide
    2. `alpha` is a constant, independent of `n_t` and `m_t`. So we treat 
either `batch` or `point` as a time unit. Using `point` as the time unit is not 
mentioned here. It is okay to put a link to the generated Scala doc.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

Reply via email to