On Mon, Nov 5, 2012 at 6:03 PM, Ted Dunning <[email protected]> wrote:
> On Mon, Nov 5, 2012 at 4:44 AM, Dan Filimon <[email protected] > >wrote: > > > > > Ted told me that Mahout Centroids [1] are Weighted vectors that > > additionally perform a Welford-style update of a vector. > > > > I think that there may be an older Centroid definition that is different > from this. > > So, in the code, for an existing Centroid c, with weight w_c, updating it > > with a new Vector v whose weight is w_v, the result of an "update" is: > > > > (w_c * c[i] + w_v * v[i]) / (w_c + w_v), for all elements (i is the > index) > > > > Correct. > > > > Since weights actually mean the number of elements in a certain cluster, > > merging two clusters is exactly the operation described above. > > > > Why is this called a Welford update? > > > > See here > > http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm Ah, thank you! I've seen this in the Mahout clustering code. By the older kind of Centroid, I think you're referring to AbstractCluster from mahout/clustering/AbstractCluster. I saw an issue on JIRA about refactoring that computation to use the OnlineGaussianAccumulator class (I'd like to do that eventually :. > > Also, why is the original Vector's function named assign? > > > This is just one of several assign functions. The name assign comes from > the original name used in the Colt library and is intended to indicate that > it is a destructive operation rather than a copying operation like times(). > > It's really an implementation of the higher order function zipwith [2]. > > > > I don't know that zipwith is a more common name. Haskell has historically > had a very small community. > That's fair. Not to be a functional troll, but having a bit of background makes it clear what the function does without having to read the code. :P But I think documenting its behavior is the better approach.
