[
https://issues.apache.org/jira/browse/MAHOUT-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800177#comment-13800177
]
Álvaro Pérez Alarcón commented on MAHOUT-1344:
----------------------------------------------
I don't have statistics on the algorithm, since my work focused more on
obtaining results than benchmarking the algorithm. I also don't have a complete
user documentation yet, but I could write one if the algorithm is resolved to
fit into the project (otherwise, I'll see into hosting it somewhere else as you
suggest, and of course include documentation with it). As a quick outline of
the usage, if the user were to use the algorithm through the code, they would
use the main() or run() methods of KohonenBatchDriver. The command line accepts
the following options:
--input <path> (-i)
bq. The input vectors
--output <path> (-o)
bq. The output path
--clusters <path> (-c)
bq. The initialization clusters, to write the generated initialization if one
was requested or to read otherwise.
--maxIter <int>
bq. Maximum iterations (-x)
--overwrite (-ow)
bq. Delete the output folder before starting the algorithm
--method sequential|mapreduce (-m)
bq. The clustering method
--clustering (-cl)
bq. If present, run ClusterClassificationDriver on the input data using the
trained clusters
--outlierThreshold <double> (-outlierThreshold)
bq. If --clustering was specified, the outlier threshold
--convergenceDelta <double> (-cd)
bq. Convergence delta. If not specified, no convergence checks are made.
--initialization rand|mean (-init)
bq. If present, a cluster initialization will be generated from the input
data, by selecting random input vectors, or adding a small random error to the
mean of all input vectors, respectively.
<Topology>
{quote}
The topology options are structured so more than one topology can be available,
only one can be selected, and all the options related to the chosen topology
are specified (otherwise the driver will fail to parse the command line). There
are two topologies available, with the following options:
Matrix topology: --matrixTopology --matrixTopologyWidth <int>
--matrixTopologyHeight <int>
Torus topology: --torusTopology --torusTopologyWidth <int>
--torusTopologyHeight <int>
{quote}
<Neighborhood function>
{quote}
Analogue to the topology, the neighborhood function is selected using one of
two sets of arguments:
Gauss neighborhood: --gaussNeighborhood --gaussNeighborhoodDecay <double>
--gaussNeighborhoodStartRadius <int>
Rectangular neighborhood: --rectangularNeighborhood
--rectangularNeighborhoodDecay <double> --rectangularNeighborhoodStartRadius
<int>
{quote}
--resume <int>
bq. Start at the nth iteration.
I've also uploaded a UML class diagram of the module, in case that helps:
http://i.imgur.com/iaz3lHh.gif .
Of the features included in the patch, there are two that could be scraped: the
possibility of resuming an interrupted job, which I added only because I was
having problems with the cluster I was using, and the rectangular neighborhood,
as it's unlikely that one would choose it over the gauss neighborhood.
As for the advantages of the algorithm, the strengths of the SOM reside on the
kind of results it produces, as they are easily visualizable and interpretable.
It operates well with any sort of numerical, non-categorical data. In
comparison to other algorithms in mahout, it's slower, since it's usually used
with more clusters, and in the first iterations it updates a large number of
clusters for each input vector (though this number gradually decreases to the
point that only one cluster is updated per input vector). For instance, a 30x30
matrix network (which is the one I used the most) implies 900 clusters.
Therefore, it wouldn't be advisable to use the SOM when the visualizability
isn't going to be exploited.
I don't know any other algorithms that produce results similar to those of a
SOM and outperform it. There are, however, extensions of the SOM algorithm that
are able to work with categorical values, such as the NCSOM. It's likely
possible to implement these extensions on top of my implementation, changing
the way the data is represented and how the pdfs are calculated
> Self-Organizing Map algorithm (batch version)
> ---------------------------------------------
>
> Key: MAHOUT-1344
> URL: https://issues.apache.org/jira/browse/MAHOUT-1344
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.8
> Reporter: Álvaro Pérez Alarcón
> Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1344.patch
>
>
> Good morning.
> As part of my final year project, I have implemented a new module for Apache
> Mahout, implementing Kohonen's self-organizing map algorithm, in its batch
> version.
> The work is already done, and I will proceed to submit a patch ASAP. It was
> developed over Mahout 0.8.
> The patch includes unit tests and the algorithm was successfully used in a
> Hadoop cluster to cluster two big datasets. Results can be seen in [this
> image gallery|http://imgur.com/a/DlgRT].
> The implementation uses the generic clustering algorithms implemented in the
> ClusterIterator class. Minor changes were made to this and other related
> classes to support some of the features, without affecting the execution of
> other algorithms.
> The algorithm supports convergence and the ability to resume a work at a
> given iteration (mainly, in order to initialize KohonenBatchClusteringPolicy
> with a given iteration number, althought it also affects the names of the
> output directories).
--
This message was sent by Atlassian JIRA
(v6.1#6144)