Great, let me see what I can build this weekend as a separate universal clusterer using these ideas
-----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Wednesday, April 13, 2011 9:46 PM To: [email protected] Cc: Jeff Eastman Subject: Re: FW: Converging Clustering and Classification Yeah... this is what I had in mind when I said grand unified theory. On Wed, Apr 13, 2011 at 9:24 PM, Jeff Eastman <[email protected]>wrote: > If this isn't all a crock, it could potentially collapse kmeans, fuzzyk and > Dirichlet into a single implementation too: > > - Begin with a prior ClusterClassifier containing the appropriate sort of > Cluster, in clusters-n > - For each input Vector, compute the pdf vector using CC.classify() > -- For kmeans, train the most likely model from the pdf vector > -- For Dirichlet, train the model selected by the multinomial of the pfd > vector * mixture vector > -- For fuzzyk, train each model by its normalized pdf (would need a new > classify method for this) > - Close the CC, computing all posterior model parameters > - Serialize the CC into clusters-n+1 > > Now that would really be cool > > > On 4/13/11 9:00 PM, Jeff Eastman wrote: > >> Lol, not too surprising considering the source. Here's how I got there: >> >> - ClusterClassifier holds a "List<Cluster> models;" field as its only >> state just like VectorModelClassifier does >> - Started with ModelSerializerTest since you suggested being compatible >> with ModelSerializer >> - This tests OnlineLogisticRegression, CrossFoldLearner and >> AdaptiveLogisticRegression >> - The first two are also subclasses of AbstractVectorClassifier just like >> ClusterClassifier >> - The tests pass OLR and CFL learners to train(OnlineLearner) so it made >> sense for a CC to be an OL too >> - The new CC.train(...) methods map to "models.get(actual).observe()" in >> Cluster.observe(V) >> - CC.close() maps to cluster.computeParameters() for each model which >> computes the posterior cluster parameters >> - Now the CC is ready for another iteration or to classify, etc. >> >> So, the cluster iteration process starts with a prior List<Cluster> which >> is used to construct the ClusterClassifier. Then in each iteration each >> point is passed to CC.classify() and the maximum probability element index >> in the returned Vector is used to train() the CC. Since all the >> DistanceMeasureClusters contain their appropriate DistanceMeasure, the one >> with the maximum pdf() is the closest. Just what kmeans already does but >> done less efficiently (it uses just the minimum distance, but pdf() = >> e^-distance so the closest cluster has the largest pdf()). >> >> Finally, instead of passing in a List<Cluster> in the KMeansClusterer I >> can just carry around a CC which wraps it. Instead of serializing a >> List<Cluster> at the end of each iteration I can just serialize the CC. At >> the beginning of the next iteration, I just deserialize it and go. >> >> I was so easy it surely must be wrong :) >> >> >> >> On 4/13/11 7:54 PM, Ted Dunning wrote: >> >>> On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman<[email protected]> >>> wrote: >>> >>> I've been able to prototype a ClusterClassifier which, like >>>> VectorModelClassifier, extends AbstractVectorClassifier but which also >>>> implements OnlineLearner and Writable. >>>> >>>> Implementing OnlineLearner is a surprise here. >>> >>> Have to think about it since the learning doesn't have a target variable. >>> >>> >>> ... If this could be completed it would seem to allow kmeans, fuzzyk, >>>> dirichlet and maybe even meanshift cluster classifiers to be used with >>>> SGD. >>>> >>>> Very cool. >>> >>> ... The challenge would be to use AVC.classify() in the various >>> clusterers >>> >>>> or to extract initial centers for kmeans& fuzzyk. Dirichlet might be >>>> adaptable more directly since its models only have to produce the pi >>>> vector >>>> of pdfs. >>>> >>>> Yes. Dirichlet is the one where this makes sense. >>> >>> >> >
