So I haven't looked super-carefully at the clustering refactoring work, can someone give a little overview of what the plan is?
The NewLDA stuff is technically in "clustering" and generally works by taking in SeqFile<IW,VW> documents as the training corpus, and spits out two things: SeqFile<IW,VW> of a "model" (keyed on topicId, one vector per topic) and a SeqFile<IW,VW> of "classifications" (keyed on docId, one vector over the topic space for projection onto each topic dimension). This is similar to how SVD clustering/decomposition works, but with L1-normed outputs instead of L2. But this seems very different from all of the structures in the rest of clustering. -jake On Wed, Feb 22, 2012 at 7:56 AM, Jeff Eastman <[email protected]>wrote: > Hi Saikat, > > I agree with Paritosh, that a great place to begin would be to write some > unit tests. This will familiarize you with the code base and help us a lot > with our 0.7 housekeeping release. The new clustering classification > components are going to unify many - but not all - of the existing > clustering algorithms to reduce their complexity by factoring out > duplication and streamlining their integration into semi-supervised > classification engines. > > Please feel free to post any questions you may have in reading through > this code. This is a major refactoring effort and we will need all the help > we can get. Thanks for the offer, > > Jeff > > > On 2/21/12 10:46 PM, Saikat Kanjilal wrote: > >> Hi Paritosh,Yes creating the test case would be a great first start, >> however are there other tasks you guys need help with before I can do >> before the test creation, I will sync trunk and start reading through the >> code in the meantime.Regards >> >> Date: Wed, 22 Feb 2012 10:57:51 +0530 >>> From: [email protected] >>> To: [email protected] >>> Subject: Re: Helping out with the .7 release >>> >>> We are creating clustering as classification components which will help >>> in moving clustering out. Once the component is ready, then the >>> clustering algorithms would need refactoring. >>> The clustering as classification component and the outlier removal >>> component has been created. >>> >>> Most of it is committed, and rest is available as a patch. See >>> https://issues.apache.org/**jira/browse/MAHOUT-929<https://issues.apache.org/jira/browse/MAHOUT-929> >>> If you will apply the latest patch available on Mahout-929 you can see >>> all that is available now. >>> >>> If you want, you can help with the test case of >>> ClusterClassificationMapper available in the patch. >>> >>> On 22-02-2012 10:27, Saikat Kanjilal wrote: >>> >>>> Hi Guys,I was interested in helping out with the clustering component >>>> of mahout, I looked through the JIRA items below and was wondering if there >>>> is a specific one that would be good to start with: >>>> >>>> https://issues.apache.org/**jira/secure/IssueNavigator.** >>>> jspa?reset=true&jqlQuery=**project+%3D+MAHOUT+AND+** >>>> resolution+%3D+Unresolved+AND+**component+%3D+Clustering+** >>>> ORDER+BY+priority+DESC&mode=**hide<https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide> >>>> >>>> I initially was thinking to work on Mahout-930 or Mahout-931 but could >>>> work on others if needed. >>>> Best Regards >>>> >>> >> > >
