Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Jeff Eastman Thu, 15 Mar 2012 13:07:05 -0700

+1 Paritosh, this is exactly what I envisioned. And I also like youridea of first converting them all to use ClusterWritable. Go for it!


On 3/15/12 10:42 AM, Paritosh Ranjan wrote:

I saw the code and my understanding of the new implementation is:
a) K-Means, Fuzzy K-Means and Dirichlet will ClusterIterator and writeIntWritable, ClusterWritbale in buildClusters phase ( Instead ofKluster, SoftCluster and DirichletCluster )b) Canopy and MeanShift will NOT use ClusterIterator but will emitIntWritable, ClusterWritable ( Instead of Canopy and MeanShiftCanopy )
There are tools ( ClusterDumper and ClusterEvaluator ) which expect<Cluster> when they read from the output file after clustering ( ~buildCluster phase ).
KMeans is expecting Canopy and KCluster, but will get ClusterWritable.

So, everything needs to be in sync ( i.e. ClusterWritable )
I propose to wrap everything in ClusterWritable first, as everythingis a Cluster ( eg. DirichletCluster, SoftCluster, Kluster, Canopy andMeanShiftCanopy ). This will remove the inconsistency without muchchaos. Once ClusterWritable is uniformly used, then refactor allalgorithms.
I am also not against making ClusterDumper unavailable for a week orso since we have ClusterOutputPostProcessor now.
Is my understanding correct? If not, please help me understand it.
If yes, which way do you propose to refactor?

On 15-03-2012 19:24, Jeff Eastman wrote:
Yes, that was my point. below It may, in fact, be impossible toimplement and commit them independently since so much of Mahoutclustering depends upon the Cluster sequenceFile. You may be able toget part way by moving the Canopy mods into the kmeans issue, butthen the cluster dumper and evaluator will not work with kmeans.
Ideas?

On 3/14/12 10:15 PM, Paritosh Ranjan wrote:
Thanks Jeff. One question, are "Use ClusterIterator" tasks dependenton "Modify Canopy etc to use ClusterWritable" task ?I am assuming that all subtasks in MAHOUT-933<https://issues.apache.org/jira/browse/MAHOUT-933> are independentof each other and the order to pick them does not matter. Am I correct?
On 15-03-2012 09:23, Jeff Eastman wrote:
Sure Paritosh, go ahead and take a crack at it. I am moving from COto PA for the next few weeks and won't be able to do much codingduring that period. I suspect you will also need to modify Canopyto emit ClusterWritable and also the RandomSeedGenerator.
Smooth sailing,
Jeff

On 3/14/12 8:28 PM, Paritosh Ranjan (Commented) (JIRA) wrote:
[https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229840#comment-13229840]
Paritosh Ranjan commented on MAHOUT-988:
----------------------------------------
Jeff, I would like to work on this issue (or MAHOUT-989, orMAHOUT-990). Can I? I might also need some help ( at least thefirst patch review ).
Convert K-means buildClusters to use new ClusterIterator
--------------------------------------------------------

                 Key: MAHOUT-988
URL:https://issues.apache.org/jira/browse/MAHOUT-988
             Project: Mahout
          Issue Type: Sub-task
          Components: Clustering
    Affects Versions: 0.6
            Reporter: Jeff Eastman
            Assignee: Jeff Eastman
             Fix For: 0.7
Refactor the current K-means implementation to use theClusterIterator/Classifier implementation. This will replace themapper, combiner, reducer, clusterer and many unit tests but willnot modify the other driver APIs, thus retaining compatibilitywith existing CLI.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRAadministrators:https://issues.apache.org/jira/secure/ContactAdministrators!default.jspaFor more information on JIRA, see:http://www.atlassian.com/software/jira

Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Reply via email to