[
https://issues.apache.org/jira/browse/MAHOUT-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791347#comment-13791347
]
Isabel Drost-Fromm commented on MAHOUT-1344:
--------------------------------------------
Can you provide some runtime statistics for your implementation? How many data
points are you able to cluster with it? How many features are feasable? How
many clusters are possible? Not having looked too closely at the code - how
does the implementation behave when increasing the number of cores in a machine
or the number of machines?
I didn't find a design doc in your patch - maybe I just overlooked it. What are
the main APIs users would use? What does the commandline look like? Do you have
any user documentation? (See
https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression and
linked issue IDs for roughly what I mean here). Also information on when to use
your implementation and when not to use it would be beneficial. What kind of
data does it perform well on?
Lastly - but in my view most importantly: Given that ultimately we want to be
useful to people that lack a deep machine learning background but want the
integrate it's features quickly into their production systems: What would be
your argument why SOMs are needed? Where do the perform superior to what is
already available? Where are their limitations, where are their strengths? As
we are currently looking to drastically minimise the number of alternatives but
still be useful to downstream users another question to answer would be whether
there are other algorithms not yet in Mahout that outperform SOMs so that we
should better look into supporting those. What's your take on that?
If it turns out that the target user of your implementation would be rather
limited, there's also the option of hosting your project on Apache Extras or
Github and mention Apache Mahout as the basis you build upon. Me personally, I
would also be happy to list it as one extension in our docs.
> Self-Organizing Map algorithm (batch version)
> ---------------------------------------------
>
> Key: MAHOUT-1344
> URL: https://issues.apache.org/jira/browse/MAHOUT-1344
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.8
> Reporter: Álvaro Pérez Alarcón
> Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1344.patch
>
>
> Good morning.
> As part of my final year project, I have implemented a new module for Apache
> Mahout, implementing Kohonen's self-organizing map algorithm, in its batch
> version.
> The work is already done, and I will proceed to submit a patch ASAP. It was
> developed over Mahout 0.8.
> The patch includes unit tests and the algorithm was successfully used in a
> Hadoop cluster to cluster two big datasets. Results can be seen in [this
> image gallery|http://imgur.com/a/DlgRT].
> The implementation uses the generic clustering algorithms implemented in the
> ClusterIterator class. Minor changes were made to this and other related
> classes to support some of the features, without affecting the execution of
> other algorithms.
> The algorithm supports convergence and the ability to resume a work at a
> given iteration (mainly, in order to initialize KohonenBatchClusteringPolicy
> with a given iteration number, althought it also affects the names of the
> output directories).
--
This message was sent by Atlassian JIRA
(v6.1#6144)