[ 
https://issues.apache.org/jira/browse/MAHOUT-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788037#comment-13788037
 ] 

Álvaro Pérez Alarcón commented on MAHOUT-1344:
----------------------------------------------

I agree that engaging with the community from the beginning would have produced 
better results... I didn't because since it was a final project, it would have 
depended on the community rather than just me, so I chose to wait until I 
actually had something to provide. This was probably unwise, since it would 
have been easier for everyone. That's also why I used the stable releases 
rather than snapshots or development versions.

I used the old clustering algorithm because when I started developing the 
algorithm the current version was 0.7, and when 0.8 was released, I just ported 
the code. I wouldn't mind taking a look at the new clustering framework and see 
if I can provide an implementation of the SOM based on it. It shouldn't be 
hard, now I'm familiar with Mahout's code. Where can I find it, though? I've 
been looking at the SVN and I fail to find anything that seems to be a 
different clustering framework other than the one I've used, which doesn't have 
changes newer than 0.8's release except for one class which wouldn't affect my 
patch.

I do have the project's writeup, but it's in Spanish. I don't have a tutorial, 
although I could write one. What information should it contain? Format of the 
input files, usage of the driver and command line arguments, format and names 
of the output files... anything else?

As for the algorithm itself, the SOM algorithm is a clustering algorithm that 
produces results that can be easily visualized in a graphic way. Since there's 
an order relation in the clusters, neighboring clusters will have similar 
centers, so the input data is associated to clusters from a region of the map, 
representing a highly dimensional input space into a low dimensional space (for 
instance, a 2D matrix). This way, big datasets can be easily analyzed by an 
expert using graphical tools. The batch version was implemented because it 
produces a faster performance than the classic version, its behavior is 
deterministic for a given cluster initialization, and parallelizing its 
execution is simpler than the classic version.

http://www.scholarpedia.org/article/Self-organizing_map
http://en.wikipedia.org/wiki/Self_organizing_map

This algorithm is used, among other things, in scientific research. For 
instance, one of the motivations of the project was the analysis of the 
observations that will be obtained in the ESA Gaia mission, which has the 
purpose of elaborate a large census of celestial bodies, of which a large 
amount is expected to fail to be classified used supervised methods (and 
therefore unsupervised classification is needed). My project's director works 
in an investigation group that was works in the design of the analysis of those 
objects, and they're using the SOM algorithm.

I think this algorithm would well fit into Mahout, since it's a widely known 
clustering algorithm with useful results, but it's not up to me to decide that, 
of course. Should I take this discussion to the mailing list? As I said 
earlier, I'm willing to provide an implementation on the new clustering 
framework if it's within my ability. I'm also willing to stick around and help 
with any issues related to my implementation.

> Self-Organizing Map algorithm (batch version)
> ---------------------------------------------
>
>                 Key: MAHOUT-1344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1344
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Álvaro Pérez Alarcón
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1344.patch
>
>
> Good morning.
> As part of my final year project, I have implemented a new module for Apache 
> Mahout, implementing Kohonen's self-organizing map algorithm, in its batch 
> version.
> The work is already done, and I will proceed to submit a patch ASAP. It was 
> developed over Mahout 0.8.
> The patch includes unit tests and the algorithm was successfully used in a 
> Hadoop cluster to cluster two big datasets. Results can be seen in [this 
> image gallery|http://imgur.com/a/DlgRT].
> The implementation uses the generic clustering algorithms implemented in the 
> ClusterIterator class. Minor changes were made to this and other related 
> classes to support some of the features, without affecting the execution of 
> other algorithms.
> The algorithm supports convergence and the ability to resume a work at a 
> given iteration (mainly, in order to initialize KohonenBatchClusteringPolicy 
> with a given iteration number, althought it also affects the names of the 
> output directories).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to