[
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224188#comment-14224188
]
Xiangrui Meng commented on SPARK-3588:
--------------------------------------
Since [~tgaloppo] already submitted a PR, we should try to avoid duplicate work
unless there are major difference on the design. It would be great if you can
help review his PR: https://github.com/apache/spark/pull/3022 . If its design
is similar, let's work together on that implementation, e.g., you can make
comments or send PRs to his branch. Or we can have more discussion on the
design.
> Gaussian Mixture Model clustering
> ---------------------------------
>
> Key: SPARK-3588
> URL: https://issues.apache.org/jira/browse/SPARK-3588
> Project: Spark
> Issue Type: New Feature
> Components: MLlib, PySpark
> Reporter: Meethu Mathew
> Assignee: Meethu Mathew
> Attachments: GMMSpark.py
>
>
> Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM
> models the entire data set as a finite mixture of Gaussian distributions,each
> parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight
> π. In this technique, probability of each point to belong to each cluster is
> computed along with the cluster statistics.
> We have come up with an initial distributed implementation of GMM in pyspark
> where the parameters are estimated using the Expectation-Maximization
> algorithm.Our current implementation considers diagonal covariance matrix for
> each component.
> We did an initial benchmark study on a 2 node Spark standalone cluster setup
> where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0.
> We also evaluated python version of k-means available in spark on the same
> datasets.
> Below are the results from this benchmark study. The reported stats are
> average from 10 runs.Tests were done on multiple datasets with varying number
> of features and instances.
> || Dataset
> || Gaussian
> mixture model ||
> Kmeans(Python) ||
>
> |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg
> time per iteration |Time for 100 iterations |
> |0.7million| 13
> |
> 7s
> |
> 12min
> |
> 13s
> | 26min
> |
> |1.8million| 11
> |
> 17s
> |
> 29min
> |
> 33s
> | 53min
> |
> |10million| 16
> |
> 1.6min
> | 2.7hr
> |
> 1.2min |
> 2hr
> |
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]