Re: Helping out with the .7 release

Jeff Eastman Wed, 22 Feb 2012 16:02:18 -0800

Hey Ted,

Could you elaborate on this approach? I don't grok how an "all reduceimplementation" can be done with a "map-only job", or how a mapper coulddo "all iteration[s] internally".

I've just gotten the ClusterIterator working in MR mode and it does whatI thought we'd been talking about earlier: In each iteration, eachmapper loads all the prior clusters and then iterates through all itsinput points, training each of the prior clusters in the process. Then,in the cleanup() method, all the trained clusters are sent to thereducers keyed by their model indexes. This eliminates the need for acombiner and means each reducer only has to merge n-mappers worth oftrained clusters into a posterior trained cluster before it is output.If numReducers == k then the current reduce-step overloads should disappear.

The secret to this implementation is to allow clusters to observe otherclusters in addition to observing vectors, thereby accumulating all ofthose clusters' observation statistics before recomputing posteriorparameters.




On 2/22/12 1:42 PM, Ted Dunning wrote:

I would also like to see if we can put an all reduce implementation into this 
effort. The idea is that we can use a map only job that does all iteration 
internally. I think that this could result in more than an order of magnitude 
speed up for our clustering codes.  It could also provide similar benefits for 
the nascent parallel classifier training work.

This seems to be a cleanup of a long standing wart in our code but it is 
reasonable that others may feel differently.

Sent from my iPhone

On Feb 22, 2012, at 10:32 AM, Jeff Eastman<[email protected]>  wrote:

This refactoring is focused on some of the iterative clustering algorithms 
which, in each iteration, load a prior set of clusters ( e.g. clusters-0) and 
process each input vector against them to produce a posterior set of clusters 
(e.g. clusters-1) for the next iteration. This will result in k-Means, fuzzyK 
and Dirichlet being collapsed into a ClusterIterator iterating over a 
ClusterClassifier using a ClusteringPolicy. You can see these classes in 
o.a.m.clustering. They are a work in progress but in-memory, sequential from 
sequenceFiles and k-means MR work in tests and can be demonstrated in the 
DisplayXX examples which employ them.

Paritosh has also been building a ClusterClassificationDriver 
(o.a.m.clustering.classify) which we want to use to factor all of the redundant 
cluster-data implementations (-cl option) out of the respective cluster 
drivers. This will affect Canopy in addition to the above algorithms.

An imagined benefit of this refactoring comes from the fact that 
ClusterClassifier extends AbstractVectorClassifier and implements 
OnlineLearner. We think this means that a posterior set of trained Clusters can 
be used as a component classifier in a semi-supervised classifier 
implementation. I suppose we will need to demonstrate this before we go too 
much further in the refactoring but Ted, at least, seems to approve of this 
integration approach between supervised classification and clustering 
(unsupervised classification). I don't think it has had a lot of other eyeballs 
on it.

I don't think LDA fits into this subset of clustering algorithms as also do not 
Canopy and MeanShift. As you note, it does not produce Clusters but I'd be 
interested in your reactions to the above.

Jeff

On 2/22/12 9:55 AM, Jake Mannix wrote:

So I haven't looked super-carefully at the clustering refactoring work, can
someone give a little overview of what
the plan is?

The NewLDA stuff is technically in "clustering" and generally works by
taking in SeqFile<IW,VW>   documents as the training corpus, and spits out
two things: SeqFile<IW,VW>   of a "model" (keyed on topicId, one vector per
topic) and a SeqFile<IW,VW>   of "classifications" (keyed on docId, one
vector over the topic space for projection onto each topic dimension).

This is similar to how SVD clustering/decomposition works, but with
L1-normed outputs instead of L2.

But this seems very different from all of the structures in the rest of
clustering.

   -jake

On Wed, Feb 22, 2012 at 7:56 AM, Jeff Eastman<[email protected]>wrote:

Hi Saikat,

I agree with Paritosh, that a great place to begin would be to write some
unit tests. This will familiarize you with the code base and help us a lot
with our 0.7 housekeeping release. The new clustering classification
components are going to unify many - but not all - of the existing
clustering algorithms to reduce their complexity by factoring out
duplication and streamlining their integration into semi-supervised
classification engines.

Please feel free to post any questions you may have in reading through
this code. This is a major refactoring effort and we will need all the help
we can get. Thanks for the offer,

Jeff


On 2/21/12 10:46 PM, Saikat Kanjilal wrote:

Hi Paritosh,Yes creating the test case would be a great first start,
however are there other tasks you guys need help with before I can do
before the test creation, I will sync trunk and start reading through the
code in the meantime.Regards

  Date: Wed, 22 Feb 2012 10:57:51 +0530

From: [email protected]
To: [email protected]
Subject: Re: Helping out with the .7 release

We are creating clustering as classification components which will help
in moving clustering out. Once the component is ready, then the
clustering algorithms would need refactoring.
The clustering as classification component and the outlier removal
component has been created.

Most of it is committed, and rest is available as a patch. See
https://issues.apache.org/**jira/browse/MAHOUT-929<https://issues.apache.org/jira/browse/MAHOUT-929>
If you will apply the latest patch available on Mahout-929 you can see
all that is available now.

If you want, you can help with the test case of
ClusterClassificationMapper available in the patch.

On 22-02-2012 10:27, Saikat Kanjilal wrote:

Hi Guys,I was interested in helping out with the clustering component
of mahout, I looked through the JIRA items below and was wondering if there
is a specific one that would be good to start with:

https://issues.apache.org/**jira/secure/IssueNavigator.**
jspa?reset=true&jqlQuery=**project+%3D+MAHOUT+AND+**
resolution+%3D+Unresolved+AND+**component+%3D+Clustering+**
ORDER+BY+priority+DESC&mode=**hide<https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide>

I initially was thinking to work on Mahout-930 or Mahout-931 but could
work on others if needed.
Best Regards

Re: Helping out with the .7 release

Reply via email to