Kam Kasravi created GEARPUMP-55:
-----------------------------------

             Summary: Add kmeans example
                 Key: GEARPUMP-55
                 URL: https://issues.apache.org/jira/browse/GEARPUMP-55
             Project: Apache Gearpump
          Issue Type: New Feature
          Components: examples
    Affects Versions: 0.8.0
            Reporter: Kam Kasravi
            Priority: Minor
             Fix For: 0.8.1


>From [pangolulu|https://github.com/pangolulu]

There is a document about streaming kmeans in Spark 
(https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html),
 I think we can try to implement it on Gearpump. Here is my processor topology 
on Gearpump:
!https://cloud.githubusercontent.com/assets/5796671/14097520/93a2b498-f5a4-11e5-8df8-ef2b62c3b5ff.PNG!

The `Source Processor` will produce points by time, then broadcast the point to 
the `Distribution Processor`. The number of tasks of the `Distribution 
Processor` is k, where each task save one center and the corresponding points. 
When `Distribution Processor` receives a point from `Source Processor`, it will 
calculate the distance of this point to its center, and then send the distance 
along with the point and its `taskId` to the `Collection Processor`. When 
`Collection Processor` receives the distance from `Distribution Processor`, it 
will accumulate the number of current points, determine if it's time to update 
center, choose the smallest distance and then send the point along with its 
corresponding `Distribution Processor` taskId by broadcast partitioner. When 
`Distribution Processor` receives the result message, task with the 
corresponding `taskId` will accumulate the point. If `Distribution Processor` 
receives that it's time to update center, then all the tasks will update its 
corresponding center.

This procedure is streaming and the center of cluster will change by time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to