[ https://issues.apache.org/jira/browse/MAHOUT-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172914#comment-13172914 ]
Paritosh Ranjan commented on MAHOUT-930: ---------------------------------------- Created MAHOUT-933 to implement a mapreduce version of ClusterIterator. > Refactor Vector Classifaction out of Clustering - Make Classification abstract > ------------------------------------------------------------------------------ > > Key: MAHOUT-930 > URL: https://issues.apache.org/jira/browse/MAHOUT-930 > Project: Mahout > Issue Type: Improvement > Components: Classification, Clustering > Affects Versions: 0.6 > Reporter: Paritosh Ranjan > Fix For: 0.7 > > > Right now, each clustering algorithm has its own runClustering ( -cp ) > implementation which produces clusteredPoints. The current design lacks : > 1) Extensibility - No place to plugin new features like outlier removal while > classification > 2) Uniformity in design - as new algorithms don't have a pattern to follow. > 3) Abstraction - the clusterData should only bother about classifying vectors > i.e. assigning different vectors to clusters. Currently it lacks a bit of > abstraction. It should not care about how to classify. That should be the > work of a separate entity, which can have features like outlier removal. > The new implementation factor out & implement an independent entity to > perform the classification step independently of the various clustering > implementations. The new design would start with ClusterClassifier, > ClusteringPolicy and ClusterIterator whose experimental versions are > available and committed. The currently committed version seems to work for > all the iterative clustering algorithms. > The ClusterClassifier provides probability of any vector belonging to the > different clusters available. These probabilities are converted into weights > by different ClusteringPolicy implementations, which are for respective > clustering algorithms. This is the place where the outlier removal > implementation can be plugged in. In future, different implementations of > ClusteringPolicy can be provided (configured) for different type of > classification. > The ClusterClassifier also gives the capability to train the existing > classifiers (clusters), by the input. This is the place where > clustering/classification will converge. > The execution is done by a ClusterIterator for now, which runs a clustering > policy on the input and tries to classify the vectors to different clusters. > It can simultaneously train the classifiers, as it can run for given number > of iterations and each iteration would improve the quality of the classifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira