-----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Thursday, November 03, 2011 2:29 PM To: [email protected] Subject: Re: Dirchlet
On Thu, Nov 3, 2011 at 2:18 PM, Jeff Eastman <[email protected]> wrote: > AbstractCluster already has the running sum of squares implemented and the > kmeans and fuzzyk combiners count on being able to combine its partial > parameters (see ClusterObservations which are passed to combiner and > reducer). I have an implementation of Wellford in OnlineGaussianAccumulator > which I would love to substitute, but I don't know the math to combine > them. If, as you say, it is "like addition", could you please be more > specific (i.e. suggest a combine(other) method for that OGA?) > That is an interesting idea to actually put that method on the OGA. I have been thinking only in terms of models, but having it there as well wouldn't be bad at all. [jeff] All the current cluster models maintain Gaussian statistics (AbstractCluster fields s0, s1, s2). If an OGA was used in AC instead then passing a model would pass its OGA and model.add(model) would just delegate to oga.add(model.getOga). Other models could do it differently, of course, but so far we don't have any. OGA does the computation of mean and variance on a per coordinate basis. This is the axis aligned case that I mentioned. > > With respect to a Dirichlet combiner, the same mechanism of combining > observations used in kmeans and fuzzyk should work, but perhaps those > combiners should be passing clusters and combining cluster observations > too, rather than just passing the running sums in ClusterObservations? > I think that a combiner based clustering should only be passing clusters. A non-combiner clustering should pass points. A resolutoin for that tension is not obvious to me. [jeff] Combiner based clustering would work for all the existing cluster models > This is something I would really like to clean up for 1.0 > Indeed.
