[ 
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144731#comment-13144731
 ] 

Paritosh Ranjan commented on MAHOUT-843:
----------------------------------------

Jeff, I analyzed the CLI creation mechanism. And also the shell script 
mechanism. Both are achievable and there is not much difference in my view.

The only advantage that I can think of by using top-bottom config parameters in 
CLI is that, it remains immune to the internal implementation of top down 
clustering. This makes it similar to other clustering CLI's. If we use shell 
script, then we write logic in it ( Iterate on top level clusters ). If the 
internal implementation of the algorithm changes, then the shell script will be 
affected.

The advantage with shell script is that, it makes CLI creation easier. 
Provided, the user is writing the shell script. Else, the shell script would 
also need arguments to define top and bottom level clustering parameters. If 
the user writes the shell script, then we make life tough for him ( a little ). 
( Question : Who will provide the shell script, Mahout, or the user ).

Another advantage of exposing post processor as a CLI is that, it enables 
Mahout to group points of similar clusters together after clustering, which is 
not available in Mahout yet.

Based on this analysis, which approach do you suggest?
                
> Top Down Clustering
> -------------------
>
>                 Key: MAHOUT-843
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-843
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>              Labels: clustering, patch
>             Fix For: 0.6
>
>         Attachments: MAHOUT-843-patch, Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find 
> comparative bigger clusters. The second step is to cluster the bigger chunks 
> into meaningful clusters. This can performance while clustering big amount of 
> data. And, it also removes the dependency of providing input clusters/numbers 
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So, 
> the control of this "bigger" and "smaller/meaningful" clusters will be 
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in 
> the bottom level can also be selected by the user. Initially, it can be done 
> for only one/few clustering algorithms, and later, option can be provided to 
> use all the algorithms ( which suits the case ). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to