[ 
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13166950#comment-13166950
 ] 

Jeff Eastman commented on MAHOUT-843:
-------------------------------------

The CLI is an important interface, since it enables scripting approaches to use 
of Mahout. As Windows users don't do much command line execution I can see that 
might take some study. CygWin has a pretty decent *UX capability and should 
prove usable. I will work on developing a CLI example using the post processor 
too so we can compare notes.

I agree with your assessment that your initial ClusterConfigs and 
ClusterExecutors code is now unnecessary. I think it was a useful experiment 
which resulted in your factoring out the cluster output post processor and the 
result is modular and clean.

In terms of our "to be implemented" clustering code I have an idea: The outlier 
pruning in MAHOUT-825 could be factored out of Canopy into another post 
processor instead of extending all of the other clustering algorithms with this 
capability. This should be a low hanging fruit for you after your top-down post 
processor work and would be a place where more sophisticated outlier rejection 
algorithms could be embedded later.

We are targeting for a 0.6 code freeze at the end of December and any testing 
of the clustering code you can do in the interim would be beneficial. Heading 
into 0.7, I want to return to the classification/clustering convergence which 
has not gotten many of my cycles in quite a while. Take a look at 
ClusterClassifier, ClusteringPolicy and ClusterIterator. They use a pluggable 
framework to converge all of the iterative clustering algorithms (those which 
process all the input vectors in each iteration and which write their state in 
clusters-n) with the classification APIs. Given your work with ClusterConfigs 
and ClusterExecutors you might find this interesting.

I'm going to close this issue now and will look forward to continuing our 
conversations on other issues. 
                
> Top Down Clustering
> -------------------
>
>                 Key: MAHOUT-843
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-843
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Jeff Eastman
>              Labels: clustering, patch
>             Fix For: 0.6
>
>         Attachments: MAHOUT-843-patch, MAHOUT-843-patch-only-postprocessor, 
> MAHOUT-843-patch-only-postprocessor-final, 
> MAHOUT-843-patch-only-postprocessor-v1, 
> MAHOUT-843-patch-only-postprocessor-v2, 
> MAHOUT-843-patch-only-postprocessor-v3, 
> MAHOUT-843-patch-only-postprocessor-v4, 
> MAHOUT-843-patch-only-postprocessor-v5, MAHOUT-843-patch-v1, 
> Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find 
> comparative bigger clusters. The second step is to cluster the bigger chunks 
> into meaningful clusters. This can performance while clustering big amount of 
> data. And, it also removes the dependency of providing input clusters/numbers 
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So, 
> the control of this "bigger" and "smaller/meaningful" clusters will be 
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in 
> the bottom level can also be selected by the user. Initially, it can be done 
> for only one/few clustering algorithms, and later, option can be provided to 
> use all the algorithms ( which suits the case ). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to