[ 
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150883#comment-13150883
 ] 

Jeff Eastman commented on MAHOUT-843:
-------------------------------------

I've had some time to poke around in the code and single-step through the test 
execution. It seems to work as advertised but I had a difficult time reading 
the code and I'm concerned that the implementation has some performance 
challenges. Much of this involves personal style so take it with a grain of 
salt:

- The code is very fine grain, with many small methods. Usually I'm faced with 
the opposite - huge methods that need to be decomposed - but here the smallness 
works against readability. Consider distributeVectors() which calls 
putVectorsInRespectiveClusters() which calls findClusterAndAddVector() which 
calls addVectorToItsClusterFile() which calls writeVectorToCluster(). These 
small methods are only called in a single place and could be inlined to provide 
more context for how the point is finally written.
- The ClusterOutputPostProcessor has a number of private fields which are 
initialized and used within this method chain rather than passing needed values 
in as method arguments. If I'm stopped in the debugger as I am now, it is 
challenging for me to identify where and how a particular field was initialized 
because it is not visible in the stack frames of the call chain.
- WriteVectorToCluster() creates a SequenceFile.Writer for each point, then 
appends to it syncs it and closes it. The path for the output file is passed in 
two levels. Seems to me that will thrash the GC and I'd consider passing the 
writer down instead of the path so you only open/append/sync/close one for each 
cluster.
- Minor now, but the run method has one -i argument which is immediately 
assigned to clusterOutputToBeProcessed. And the inputWriter is actually writing 
output. Seems backwards. I'd suggest adding a -o argument so callers have full 
control over where the points come from and where they get written.
- Finally, I'd also suggest adding a -ci argument so that the clusters can be 
read in the beginning. This will facilitate setting numReducers later and would 
allow you to create all the output writers at the beginning rather than lazily 
as they are encountered. 

All this aside however, it seems to work and is a good step forward. Once the 
mapreduce version gets fleshed out you will likely want to refactor the whole 
thing anyway. 
                
> Top Down Clustering
> -------------------
>
>                 Key: MAHOUT-843
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-843
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>              Labels: clustering, patch
>             Fix For: 0.6
>
>         Attachments: MAHOUT-843-patch, MAHOUT-843-patch-only-postprocessor, 
> MAHOUT-843-patch-only-postprocessor-v1, 
> MAHOUT-843-patch-only-postprocessor-v2, MAHOUT-843-patch-v1, 
> Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find 
> comparative bigger clusters. The second step is to cluster the bigger chunks 
> into meaningful clusters. This can performance while clustering big amount of 
> data. And, it also removes the dependency of providing input clusters/numbers 
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So, 
> the control of this "bigger" and "smaller/meaningful" clusters will be 
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in 
> the bottom level can also be selected by the user. Initially, it can be done 
> for only one/few clustering algorithms, and later, option can be provided to 
> use all the algorithms ( which suits the case ). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to