Re: CardinalityException in DirichletDriver

Jeff Eastman Wed, 20 Jan 2010 09:21:08 -0800

Bogdan,

I coded this up and wrote a unit test which appears to identify thecorrect number and shape of the test models. It does indeed preserve themodel sparseness, so I committed it.

The Model Distribution uses a uniform, empty prior of the propercardinality. The algorithm does not seem to mind this and convergespretty quickly on a stable set of models.

The unit test uses the Lucene utilities to compute TFIDF vectors whichare input to the DirichletClusterer.

It would be interesting to see if it performs at all well on your moreextensive data. Feel free to suggest improvements.


Jeff


Jeff Eastman wrote:

Hi Ted,

Ok, from this and looking at your code here is what I get:
L1Model has a single, sparse coefficient vector M[t] where eachcoefficient is the probability of that term being present in themodel. As (TF-IDF?) data values X[t] are scanned the pdf(X) for eachmodel would be exp(- ManhattanDistanceMeasure(M, X)). The list of pdfstimes the mixture probabilities is then sampled as a multinomial whichselects a particular model from the list of available models. When themodel then observes(X[t]), M=M+X and a count of observed values isincremented. When computeParameters() is called, presumably M isnormalized (regularized?) and then sampled somehow to become theposterior model for the next iteration.
L1ModelDistribution needs to compute a list of models from its priorand posterior distributions. What is known about each prior model?M[t] should have some non-zero coefficients but we don't know whichones? Seems like we could pick a few at random. Even if they are allidentical with empty Ms, the multinomial will still force the datavalues into different models and, after the iteration is over, themodels will all be different and will diverge from each other as they(hopefully) converge upon a description of the corpus. That's a littlelike what kMeans does with random initial clusters and how Dirichletworks with NormalModelDistributions (all prior models are identicalwith zero mean coefficients).
This has a lot of question marks in it but I'm pressing send anyhow,
Jeff


Ted Dunning wrote:
On Tue, Jan 19, 2010 at 10:58 AM, Jeff Eastman
<[email protected]>wrote:
Looking in MAHOUT-228-3.patch, I don't see any sparse vectorizer.Did you
have another patch in mind?
There should have been one.  Let me check to figure out the name.
I'm trying to wrap my mind around "L-1 model distribution".
For the classifier learning, what we have is a prior distribution for
classifiers that has probability proportional to exp(-sum(abs(w_i))). Thelog of this probability is - sum(abs(w_i)) = L_1(w) which gives thename.
This log probability is what is used as a regularization term in the
optimization of the classifier.
It isn't obvious from this definition, but this prior/regularizer hastheeffect of preferring sparse models (for classification). Where L_2priorsprefer lots of small weights in ambiguous conditions because thepenalty onlarge coefficients is so large, L_1 priors prefer to focus the weighton one
or a few larger coefficients.
.... Would an L-1 model vector only have integer-valued elements?
In the sense that 0 is an integer, yes.  :-)

But what it prefers is zero valued coefficients.

Re: CardinalityException in DirichletDriver

Reply via email to