Hi Jeff, I will try with the NormalModelDistribution but I am wondering how to obtain "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the source containing the changes, do I simply sync from trunk? I suppose I have to run mvn install after that, right?
Best regards, Bogdan On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <[email protected]>wrote: > Bogdan, > > Recent resolution of MAHOUT-251 should allow you to experiment with > Dirichlet clustering for text models with arbitrary dimensionality. I > suggest starting with the NormalModelDistribution with a large sparse vector > as its prototype. The other model distributions create sampled values for > all the prior model dimensions, negating any value of using sparse vectors > for their prototypes. > > It may in fact be necessary to introduce a new ModelDistribution and Model > so that sparse model elements will not fill up with insignificant values. > After the first iteration computes the new posterior model parameters from > the observations, many of these values will likely be small so some > heuristic would be needed to preserve model sparseness by removing them > altogether. If all these values are retained, it is probably better to use a > dense vector representation. A 50k-dimensional model will be a real compute > hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or > sample() would be good places to embed this heuristic. > > I'll begin writing some tests to experiment with these models. > > > -- Best regards, Bogdan
