[ 
https://issues.apache.org/jira/browse/MAHOUT-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800177#comment-13800177
 ] 

Álvaro Pérez Alarcón commented on MAHOUT-1344:
----------------------------------------------

I don't have statistics on the algorithm, since my work focused more on 
obtaining results than benchmarking the algorithm. I also don't have a complete 
user documentation yet, but I could write one if the algorithm is resolved to 
fit into the project (otherwise, I'll see into hosting it somewhere else as you 
suggest, and of course include documentation with it). As a quick outline of 
the usage, if the user were to use the algorithm through the code, they would 
use the main() or run() methods of KohonenBatchDriver. The command line accepts 
the following options:

--input <path> (-i)
bq. The input vectors

--output <path> (-o)
bq. The output path

--clusters <path> (-c)
bq. The initialization clusters, to write the generated initialization if one 
was requested or to read otherwise.

--maxIter <int>
bq. Maximum iterations (-x)

--overwrite (-ow)
bq. Delete the output folder before starting the algorithm

--method sequential|mapreduce (-m)
bq. The clustering method

--clustering (-cl)
bq. If present, run ClusterClassificationDriver on the input data using the 
trained clusters

--outlierThreshold <double> (-outlierThreshold)
bq. If --clustering was specified, the outlier threshold

--convergenceDelta <double> (-cd)
bq. Convergence delta. If not specified, no convergence checks are made.

--initialization rand|mean (-init)
bq. If present, a cluster initialization will be generated  from the input 
data, by selecting random input vectors, or adding a small random error to the 
mean of all input vectors, respectively.

<Topology>
{quote}
The topology options are structured so more than one topology can be available, 
only one can be selected, and all the options related to the chosen topology 
are specified (otherwise the driver will fail to parse the command line). There 
are two topologies available, with the following options:
Matrix topology: --matrixTopology --matrixTopologyWidth <int> 
--matrixTopologyHeight <int> 
Torus topology: --torusTopology --torusTopologyWidth <int> 
--torusTopologyHeight <int>
{quote}

<Neighborhood function>
{quote}
Analogue to the topology, the neighborhood function is selected using one of 
two sets of arguments:
Gauss neighborhood: --gaussNeighborhood --gaussNeighborhoodDecay <double> 
--gaussNeighborhoodStartRadius <int>
Rectangular neighborhood: --rectangularNeighborhood 
--rectangularNeighborhoodDecay <double> --rectangularNeighborhoodStartRadius 
<int>
{quote}

--resume <int>
bq. Start at the nth iteration.

I've also uploaded a UML class diagram of the module, in case that helps: 
http://i.imgur.com/iaz3lHh.gif .
Of the features included in the patch, there are two that could be scraped: the 
possibility of resuming an interrupted job, which I added only because I was 
having problems with the cluster I was using, and the rectangular neighborhood, 
as it's unlikely that one would choose it over the gauss neighborhood.

As for the advantages of the algorithm, the strengths of the SOM reside on the 
kind of results it produces, as they are easily visualizable and interpretable. 
It operates well with any sort of numerical, non-categorical data. In 
comparison to other algorithms in mahout, it's slower, since it's usually used 
with more clusters, and in the first iterations it updates a large number of 
clusters for each input vector (though this number gradually decreases to the 
point that only one cluster is updated per input vector). For instance, a 30x30 
matrix network (which is the one I used the most) implies 900 clusters. 
Therefore, it wouldn't be advisable to use the SOM when the visualizability 
isn't going to be exploited.

I don't know any other algorithms that produce results similar to those of a 
SOM and outperform it. There are, however, extensions of the SOM algorithm that 
are able to work with categorical values, such as the NCSOM. It's likely 
possible to implement these extensions on top of my implementation, changing 
the way the data is represented and how the pdfs are calculated

> Self-Organizing Map algorithm (batch version)
> ---------------------------------------------
>
>                 Key: MAHOUT-1344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1344
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Álvaro Pérez Alarcón
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1344.patch
>
>
> Good morning.
> As part of my final year project, I have implemented a new module for Apache 
> Mahout, implementing Kohonen's self-organizing map algorithm, in its batch 
> version.
> The work is already done, and I will proceed to submit a patch ASAP. It was 
> developed over Mahout 0.8.
> The patch includes unit tests and the algorithm was successfully used in a 
> Hadoop cluster to cluster two big datasets. Results can be seen in [this 
> image gallery|http://imgur.com/a/DlgRT].
> The implementation uses the generic clustering algorithms implemented in the 
> ClusterIterator class. Minor changes were made to this and other related 
> classes to support some of the features, without affecting the execution of 
> other algorithms.
> The algorithm supports convergence and the ability to resume a work at a 
> given iteration (mainly, in order to initialize KohonenBatchClusteringPolicy 
> with a given iteration number, althought it also affects the names of the 
> output directories).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to