Highjacking the sparse vectorizer from the SGD patch might help with this. Likewise, using an L-1 model distribution would enforce sparseness by nature (I think). Sampling from the L-1 prior might be a bit of a trip.
On Mon, Jan 18, 2010 at 4:27 PM, Jeff Eastman <[email protected]>wrote: > I think you will need to bound your model dimensionality to use Dirichlet. > If you are using TF-IDF vectors to represent your documents I would think > these would all have the same maximum cardinality which you could specify > for the modelPrototype size. I just committed a new model distribution > (SparseNormalModelDistribution) that includes a heuristic > sampleFromPosterior() to remove small mean element values to preserve model > sparseness. It's probably bogus but a place to begin. > > I have also written one new unit test that runs in memory over a small, > 50-d sparse model and 100, 50-d sparse vectors. It does not explode. > > Just do another update before you begin to pick up those changes. > > > Bogdan Vatkov wrote: > >> Well, dimensions - I am just using slightly modified version of >> LuceneDriver >> (added stopword removal and regex removal of incoming terms), so I guess >> it >> is just a list of unidimentional vectors of random length. >> I will try to run the new code tomorrow. >> >> On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman >> <[email protected]>wrote: >> >> >> >>> Yes, they're all in trunk. Just do an svn update and mvn install to get >>> them. >>> >>> BTW, what's the dimensionality of your data? >>> >>> Jeff >>> >>> >>> >>> Bogdan Vatkov wrote: >>> >>> >>> >>>> Hi Jeff, >>>> >>>> I will try with the NormalModelDistribution but I am wondering how to >>>> obtain >>>> "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the >>>> source containing the changes, do I simply sync from trunk? I suppose I >>>> have >>>> to run mvn install after that, right? >>>> >>>> Best regards, >>>> Bogdan >>>> >>>> On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman < >>>> [email protected] >>>> >>>> >>>>> wrote: >>>>> >>>>> >>>> >>>> >>>> >>>>> Bogdan, >>>>> >>>>> Recent resolution of MAHOUT-251 should allow you to experiment with >>>>> Dirichlet clustering for text models with arbitrary dimensionality. I >>>>> suggest starting with the NormalModelDistribution with a large sparse >>>>> vector >>>>> as its prototype. The other model distributions create sampled values >>>>> for >>>>> all the prior model dimensions, negating any value of using sparse >>>>> vectors >>>>> for their prototypes. >>>>> >>>>> It may in fact be necessary to introduce a new ModelDistribution and >>>>> Model >>>>> so that sparse model elements will not fill up with insignificant >>>>> values. >>>>> After the first iteration computes the new posterior model parameters >>>>> from >>>>> the observations, many of these values will likely be small so some >>>>> heuristic would be needed to preserve model sparseness by removing them >>>>> altogether. If all these values are retained, it is probably better to >>>>> use a >>>>> dense vector representation. A 50k-dimensional model will be a real >>>>> compute >>>>> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or >>>>> sample() would be good places to embed this heuristic. >>>>> >>>>> I'll begin writing some tests to experiment with these models. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> >> > > -- Ted Dunning, CTO DeepDyve
