[
https://issues.apache.org/jira/browse/MAHOUT-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788037#comment-13788037
]
Álvaro Pérez Alarcón commented on MAHOUT-1344:
----------------------------------------------
I agree that engaging with the community from the beginning would have produced
better results... I didn't because since it was a final project, it would have
depended on the community rather than just me, so I chose to wait until I
actually had something to provide. This was probably unwise, since it would
have been easier for everyone. That's also why I used the stable releases
rather than snapshots or development versions.
I used the old clustering algorithm because when I started developing the
algorithm the current version was 0.7, and when 0.8 was released, I just ported
the code. I wouldn't mind taking a look at the new clustering framework and see
if I can provide an implementation of the SOM based on it. It shouldn't be
hard, now I'm familiar with Mahout's code. Where can I find it, though? I've
been looking at the SVN and I fail to find anything that seems to be a
different clustering framework other than the one I've used, which doesn't have
changes newer than 0.8's release except for one class which wouldn't affect my
patch.
I do have the project's writeup, but it's in Spanish. I don't have a tutorial,
although I could write one. What information should it contain? Format of the
input files, usage of the driver and command line arguments, format and names
of the output files... anything else?
As for the algorithm itself, the SOM algorithm is a clustering algorithm that
produces results that can be easily visualized in a graphic way. Since there's
an order relation in the clusters, neighboring clusters will have similar
centers, so the input data is associated to clusters from a region of the map,
representing a highly dimensional input space into a low dimensional space (for
instance, a 2D matrix). This way, big datasets can be easily analyzed by an
expert using graphical tools. The batch version was implemented because it
produces a faster performance than the classic version, its behavior is
deterministic for a given cluster initialization, and parallelizing its
execution is simpler than the classic version.
http://www.scholarpedia.org/article/Self-organizing_map
http://en.wikipedia.org/wiki/Self_organizing_map
This algorithm is used, among other things, in scientific research. For
instance, one of the motivations of the project was the analysis of the
observations that will be obtained in the ESA Gaia mission, which has the
purpose of elaborate a large census of celestial bodies, of which a large
amount is expected to fail to be classified used supervised methods (and
therefore unsupervised classification is needed). My project's director works
in an investigation group that was works in the design of the analysis of those
objects, and they're using the SOM algorithm.
I think this algorithm would well fit into Mahout, since it's a widely known
clustering algorithm with useful results, but it's not up to me to decide that,
of course. Should I take this discussion to the mailing list? As I said
earlier, I'm willing to provide an implementation on the new clustering
framework if it's within my ability. I'm also willing to stick around and help
with any issues related to my implementation.
> Self-Organizing Map algorithm (batch version)
> ---------------------------------------------
>
> Key: MAHOUT-1344
> URL: https://issues.apache.org/jira/browse/MAHOUT-1344
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.8
> Reporter: Álvaro Pérez Alarcón
> Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1344.patch
>
>
> Good morning.
> As part of my final year project, I have implemented a new module for Apache
> Mahout, implementing Kohonen's self-organizing map algorithm, in its batch
> version.
> The work is already done, and I will proceed to submit a patch ASAP. It was
> developed over Mahout 0.8.
> The patch includes unit tests and the algorithm was successfully used in a
> Hadoop cluster to cluster two big datasets. Results can be seen in [this
> image gallery|http://imgur.com/a/DlgRT].
> The implementation uses the generic clustering algorithms implemented in the
> ClusterIterator class. Minor changes were made to this and other related
> classes to support some of the features, without affecting the execution of
> other algorithms.
> The algorithm supports convergence and the ability to resume a work at a
> given iteration (mainly, in order to initialize KohonenBatchClusteringPolicy
> with a given iteration number, althought it also affects the names of the
> output directories).
--
This message was sent by Atlassian JIRA
(v6.1#6144)