Well, dimensions - I am just using slightly modified version of LuceneDriver
(added stopword removal and regex removal of incoming terms), so I guess it
is just a list of unidimentional vectors of random length.
I will try to run the new code tomorrow.

On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman
<[email protected]>wrote:

> Yes, they're all in trunk. Just do an svn update and mvn install to get
> them.
>
> BTW, what's the dimensionality of your data?
>
> Jeff
>
>
>
> Bogdan Vatkov wrote:
>
>> Hi Jeff,
>>
>> I will try with the NormalModelDistribution but I am wondering how to
>> obtain
>> "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
>> source containing the changes, do I simply sync from trunk? I suppose I
>> have
>> to run mvn install after that, right?
>>
>> Best regards,
>> Bogdan
>>
>> On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <[email protected]
>> >wrote:
>>
>>
>>
>>> Bogdan,
>>>
>>> Recent resolution of MAHOUT-251 should allow you to experiment with
>>> Dirichlet clustering for text models with arbitrary dimensionality. I
>>> suggest starting with the NormalModelDistribution with a large sparse
>>> vector
>>> as its prototype.  The other model distributions create sampled values
>>> for
>>> all the prior model dimensions, negating any value of using sparse
>>> vectors
>>> for their prototypes.
>>>
>>> It may in fact be necessary to introduce a new ModelDistribution and
>>> Model
>>> so that sparse model elements will not fill up with insignificant values.
>>> After the first iteration computes the new posterior model parameters
>>> from
>>> the observations, many of these values will likely be small so some
>>> heuristic would be needed to preserve model sparseness by removing them
>>> altogether. If all these values are retained, it is probably better to
>>> use a
>>> dense vector representation. A 50k-dimensional model will be a real
>>> compute
>>> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
>>> sample() would be good places to embed this heuristic.
>>>
>>> I'll begin writing some tests to experiment with these models.
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Best regards,
Bogdan

Reply via email to