Ted,
Looking in MAHOUT-228-3.patch, I don't see any sparse vectorizer. Did
you have another patch in mind?
I'm trying to wrap my mind around "L-1 model distribution". I recall the
earlier discussions of L-n norms on list related to our distance
measures but cannot connect the dots. Would an L-1 model vector only
have integer-valued elements?
Jeff
Ted Dunning wrote:
Highjacking the sparse vectorizer from the SGD patch might help with this.
Likewise, using an L-1 model distribution would enforce sparseness by nature
(I think). Sampling from the L-1 prior might be a bit of a trip.
On Mon, Jan 18, 2010 at 4:27 PM, Jeff Eastman <[email protected]>wrote:
I think you will need to bound your model dimensionality to use Dirichlet.
If you are using TF-IDF vectors to represent your documents I would think
these would all have the same maximum cardinality which you could specify
for the modelPrototype size. I just committed a new model distribution
(SparseNormalModelDistribution) that includes a heuristic
sampleFromPosterior() to remove small mean element values to preserve model
sparseness. It's probably bogus but a place to begin.
I have also written one new unit test that runs in memory over a small,
50-d sparse model and 100, 50-d sparse vectors. It does not explode.
Just do another update before you begin to pick up those changes.
Bogdan Vatkov wrote:
Well, dimensions - I am just using slightly modified version of
LuceneDriver
(added stopword removal and regex removal of incoming terms), so I guess
it
is just a list of unidimentional vectors of random length.
I will try to run the new code tomorrow.
On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman
<[email protected]>wrote:
Yes, they're all in trunk. Just do an svn update and mvn install to get
them.
BTW, what's the dimensionality of your data?
Jeff
Bogdan Vatkov wrote:
Hi Jeff,
I will try with the NormalModelDistribution but I am wondering how to
obtain
"MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
source containing the changes, do I simply sync from trunk? I suppose I
have
to run mvn install after that, right?
Best regards,
Bogdan
On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <
[email protected]
wrote:
Bogdan,
Recent resolution of MAHOUT-251 should allow you to experiment with
Dirichlet clustering for text models with arbitrary dimensionality. I
suggest starting with the NormalModelDistribution with a large sparse
vector
as its prototype. The other model distributions create sampled values
for
all the prior model dimensions, negating any value of using sparse
vectors
for their prototypes.
It may in fact be necessary to introduce a new ModelDistribution and
Model
so that sparse model elements will not fill up with insignificant
values.
After the first iteration computes the new posterior model parameters
from
the observations, many of these values will likely be small so some
heuristic would be needed to preserve model sparseness by removing them
altogether. If all these values are retained, it is probably better to
use a
dense vector representation. A 50k-dimensional model will be a real
compute
hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
sample() would be good places to embed this heuristic.
I'll begin writing some tests to experiment with these models.