[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering

Xiangrui Meng (JIRA) Tue, 25 Nov 2014 00:15:13 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224188#comment-14224188
 ]


Xiangrui Meng commented on SPARK-3588:
--------------------------------------

Since [~tgaloppo] already submitted a PR, we should try to avoid duplicate work 
unless there are major difference on the design. It would be great if you can 
help review his PR: https://github.com/apache/spark/pull/3022 . If its design 
is similar, let's work together on that implementation, e.g., you can make 
comments or send PRs to his branch. Or we can have more discussion on the 
design.

> Gaussian Mixture Model clustering
> ---------------------------------
>
>                 Key: SPARK-3588
>                 URL: https://issues.apache.org/jira/browse/SPARK-3588
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib, PySpark
>            Reporter: Meethu Mathew
>            Assignee: Meethu Mathew
>         Attachments: GMMSpark.py
>
>
> Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
> models the entire data set as a finite mixture of Gaussian distributions,each 
> parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
> π. In this technique, probability of  each point to belong to each cluster is 
> computed along with the cluster statistics.
> We have come up with an initial distributed implementation of GMM in pyspark 
> where the parameters are estimated using the  Expectation-Maximization 
> algorithm.Our current implementation considers diagonal covariance matrix for 
> each component.
> We did an initial benchmark study on a  2 node Spark standalone cluster setup 
> where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
> We also evaluated python version of k-means available in spark on the same 
> datasets.
> Below are the results from this benchmark study. The reported stats are 
> average from 10 runs.Tests were done on multiple datasets with varying number 
> of features and instances.
> ||&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Dataset  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||&nbsp;&nbsp;&nbsp;Gaussian
>  mixture model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|| 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Kmeans(Python)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||
>          
> |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
> time per iteration |Time for 100 iterations | 
> |0.7million| &nbsp;&nbsp;&nbsp;13 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   7s 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     12min 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  
> &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     13s  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  &nbsp;&nbsp;&nbsp;&nbsp;    26min 
> &nbsp;&nbsp;&nbsp;    |
> |1.8million| &nbsp;&nbsp;&nbsp;11 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|   
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  17s 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     29min 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  
> &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     33s  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  &nbsp;&nbsp;&nbsp;&nbsp;    53min 
> &nbsp;&nbsp;&nbsp;  |
> |10million|&nbsp;&nbsp;&nbsp;16 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1.6min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   
>  | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     2.7hr 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     1.2min &nbsp;&nbsp;&nbsp;&nbsp;    | 
>  &nbsp;&nbsp;&nbsp;&nbsp;    2hr &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   
>  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering

Reply via email to