[ https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150883#comment-13150883 ]
Jeff Eastman commented on MAHOUT-843: ------------------------------------- I've had some time to poke around in the code and single-step through the test execution. It seems to work as advertised but I had a difficult time reading the code and I'm concerned that the implementation has some performance challenges. Much of this involves personal style so take it with a grain of salt: - The code is very fine grain, with many small methods. Usually I'm faced with the opposite - huge methods that need to be decomposed - but here the smallness works against readability. Consider distributeVectors() which calls putVectorsInRespectiveClusters() which calls findClusterAndAddVector() which calls addVectorToItsClusterFile() which calls writeVectorToCluster(). These small methods are only called in a single place and could be inlined to provide more context for how the point is finally written. - The ClusterOutputPostProcessor has a number of private fields which are initialized and used within this method chain rather than passing needed values in as method arguments. If I'm stopped in the debugger as I am now, it is challenging for me to identify where and how a particular field was initialized because it is not visible in the stack frames of the call chain. - WriteVectorToCluster() creates a SequenceFile.Writer for each point, then appends to it syncs it and closes it. The path for the output file is passed in two levels. Seems to me that will thrash the GC and I'd consider passing the writer down instead of the path so you only open/append/sync/close one for each cluster. - Minor now, but the run method has one -i argument which is immediately assigned to clusterOutputToBeProcessed. And the inputWriter is actually writing output. Seems backwards. I'd suggest adding a -o argument so callers have full control over where the points come from and where they get written. - Finally, I'd also suggest adding a -ci argument so that the clusters can be read in the beginning. This will facilitate setting numReducers later and would allow you to create all the output writers at the beginning rather than lazily as they are encountered. All this aside however, it seems to work and is a good step forward. Once the mapreduce version gets fleshed out you will likely want to refactor the whole thing anyway. > Top Down Clustering > ------------------- > > Key: MAHOUT-843 > URL: https://issues.apache.org/jira/browse/MAHOUT-843 > Project: Mahout > Issue Type: New Feature > Components: Clustering > Affects Versions: 0.6 > Reporter: Paritosh Ranjan > Labels: clustering, patch > Fix For: 0.6 > > Attachments: MAHOUT-843-patch, MAHOUT-843-patch-only-postprocessor, > MAHOUT-843-patch-only-postprocessor-v1, > MAHOUT-843-patch-only-postprocessor-v2, MAHOUT-843-patch-v1, > Top-Down-Clustering-patch > > > Top Down Clustering works in multiple steps. The first step is to find > comparative bigger clusters. The second step is to cluster the bigger chunks > into meaningful clusters. This can performance while clustering big amount of > data. And, it also removes the dependency of providing input clusters/numbers > to the clustering algorithm. > The "big" is a relative term, as well as the smaller "meaningful" terms. So, > the control of this "bigger" and "smaller/meaningful" clusters will be > controlled by the user. > Which clustering algorithm to be used in the top level and which to use in > the bottom level can also be selected by the user. Initially, it can be done > for only one/few clustering algorithms, and later, option can be provided to > use all the algorithms ( which suits the case ). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira