Hi Jeff,

I will try with the NormalModelDistribution but I am wondering how to obtain
"MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
source containing the changes, do I simply sync from trunk? I suppose I have
to run mvn install after that, right?

Best regards,
Bogdan

On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <[email protected]>wrote:

> Bogdan,
>
> Recent resolution of MAHOUT-251 should allow you to experiment with
> Dirichlet clustering for text models with arbitrary dimensionality. I
> suggest starting with the NormalModelDistribution with a large sparse vector
> as its prototype.  The other model distributions create sampled values for
> all the prior model dimensions, negating any value of using sparse vectors
> for their prototypes.
>
> It may in fact be necessary to introduce a new ModelDistribution and Model
> so that sparse model elements will not fill up with insignificant values.
> After the first iteration computes the new posterior model parameters from
> the observations, many of these values will likely be small so some
> heuristic would be needed to preserve model sparseness by removing them
> altogether. If all these values are retained, it is probably better to use a
> dense vector representation. A 50k-dimensional model will be a real compute
> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
> sample() would be good places to embed this heuristic.
>
> I'll begin writing some tests to experiment with these models.
>
>
>


-- 
Best regards,
Bogdan

Reply via email to