Kam Kasravi created GEARPUMP-55:
-----------------------------------
Summary: Add kmeans example
Key: GEARPUMP-55
URL: https://issues.apache.org/jira/browse/GEARPUMP-55
Project: Apache Gearpump
Issue Type: New Feature
Components: examples
Affects Versions: 0.8.0
Reporter: Kam Kasravi
Priority: Minor
Fix For: 0.8.1
>From [pangolulu|https://github.com/pangolulu]
There is a document about streaming kmeans in Spark
(https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html),
I think we can try to implement it on Gearpump. Here is my processor topology
on Gearpump:
!https://cloud.githubusercontent.com/assets/5796671/14097520/93a2b498-f5a4-11e5-8df8-ef2b62c3b5ff.PNG!
The `Source Processor` will produce points by time, then broadcast the point to
the `Distribution Processor`. The number of tasks of the `Distribution
Processor` is k, where each task save one center and the corresponding points.
When `Distribution Processor` receives a point from `Source Processor`, it will
calculate the distance of this point to its center, and then send the distance
along with the point and its `taskId` to the `Collection Processor`. When
`Collection Processor` receives the distance from `Distribution Processor`, it
will accumulate the number of current points, determine if it's time to update
center, choose the smallest distance and then send the point along with its
corresponding `Distribution Processor` taskId by broadcast partitioner. When
`Distribution Processor` receives the result message, task with the
corresponding `taskId` will accumulate the point. If `Distribution Processor`
receives that it's time to update center, then all the tasks will update its
corresponding center.
This procedure is streaming and the center of cluster will change by time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)