[jira] [Commented] (MAHOUT-1344) Self-Organizing Map algorithm (batch version)

Isabel Drost-Fromm (JIRA) Thu, 10 Oct 2013 02:24:22 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791347#comment-13791347
 ]


Isabel Drost-Fromm commented on MAHOUT-1344:
--------------------------------------------

Can you provide some runtime statistics for your implementation? How many data 
points are you able to cluster with it? How many features are feasable? How 
many clusters are possible? Not having looked too closely at the code - how 
does the implementation behave when increasing the number of cores in a machine 
or the number of machines?

I didn't find a design doc in your patch - maybe I just overlooked it. What are 
the main APIs users would use? What does the commandline look like? Do you have 
any user documentation? (See 
https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression and 
linked issue IDs for roughly what I mean here). Also information on when to use 
your implementation and when not to use it would be beneficial. What kind of 
data does it perform well on?

Lastly - but in my view most importantly: Given that ultimately we want to be 
useful to people that lack a deep machine learning background but want the 
integrate it's features quickly into their production systems: What would be 
your argument why SOMs are needed? Where do the perform superior to what is 
already available? Where are their limitations, where are their strengths? As 
we are currently looking to drastically minimise the number of alternatives but 
still be useful to downstream users another question to answer would be whether 
there are other algorithms not yet in Mahout that outperform SOMs so that we 
should better look into supporting those. What's your take on that?

If it turns out that the target user of your implementation would be rather 
limited, there's also the option of hosting your project on Apache Extras or 
Github and mention Apache Mahout as the basis you build upon. Me personally, I 
would also be happy to list it as one extension in our docs.

> Self-Organizing Map algorithm (batch version)
> ---------------------------------------------
>
>                 Key: MAHOUT-1344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1344
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Álvaro Pérez Alarcón
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1344.patch
>
>
> Good morning.
> As part of my final year project, I have implemented a new module for Apache 
> Mahout, implementing Kohonen's self-organizing map algorithm, in its batch 
> version.
> The work is already done, and I will proceed to submit a patch ASAP. It was 
> developed over Mahout 0.8.
> The patch includes unit tests and the algorithm was successfully used in a 
> Hadoop cluster to cluster two big datasets. Results can be seen in [this 
> image gallery|http://imgur.com/a/DlgRT].
> The implementation uses the generic clustering algorithms implemented in the 
> ClusterIterator class. Minor changes were made to this and other related 
> classes to support some of the features, without affecting the execution of 
> other algorithms.
> The algorithm supports convergence and the ability to resume a work at a 
> given iteration (mainly, in order to initialize KohonenBatchClusteringPolicy 
> with a given iteration number, althought it also affects the names of the 
> output directories).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (MAHOUT-1344) Self-Organizing Map algorithm (batch version)

Reply via email to