[jira] [Commented] (SPARK-6137) G-Means clustering algorithm implementation

Joseph K. Bradley (JIRA) Wed, 08 Apr 2015 14:26:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486085#comment-14486085
 ]


Joseph K. Bradley commented on SPARK-6137:
------------------------------------------

[~rumshenoy] Thanks for your interest!  If you're new to Spark contributions, 
I'd strongly recommend starting with smaller patches before working on a new 
algorithm.  This helps you to get used to Spark's review process, coding style, 
etc., and it helps reviewers and committers get to know you (since they need to 
allocate their time for reviewing carefully).  I'd recommend finding a JIRA to 
work on by browsing topics of interest to you and finding ones which sound 
smaller.  Here's some more info on contributing:
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]

Once you're acclimated, you can request to work on a task like this again.  A 
task like this will require:
* Understanding the literature: Is this the best algorithm, and is it commonly 
used enough to be in Spark (as opposed to a package)?  (Others can help out 
here.)
* API design, implementation design, testing, documentation
* Scalability testing: Make sure the distributed implementation is efficient

> G-Means clustering algorithm implementation
> -------------------------------------------
>
>                 Key: SPARK-6137
>                 URL: https://issues.apache.org/jira/browse/SPARK-6137
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Denis Dus
>            Priority: Minor
>              Labels: clustering
>
> Will it be useful to implement G-Means clustering algorithm based on K-Means?
> G-means is a powerful extension of k-means, which uses test of cluster data 
> normality to decide if it necessary to split current cluster into new two. 
> It's relative complexity (compared to k-Means) is O(K), where K is maximum 
> number of clusters. 
> The original paper is by Greg Hamerly and Charles Elkan from University of 
> California:
> [http://papers.nips.cc/paper/2526-learning-the-k-in-k-means.pdf]
> I also have a small prototype of this algorithm written in R (if anyone is 
> interested in it).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6137) G-Means clustering algorithm implementation

Reply via email to