RE: Potential regression in ASFEmail KMeans clustering

Joshi, Shrinivas Wed, 18 Jul 2012 22:10:35 -0700

Hi,

Looks like Mahout 0.6 was eliminating more vectors during tfidf processing than 
what the current trunk does. Changes in 
https://issues.apache.org/jira/browse/MAHOUT-973 cause the increased size of 
tfidf-vectors mentioned below and hence the longer execution times for 
clustering iteration jobs as compared to Mahout 0.6. The asf-email example 
script that comes with the distribution sets maxDFPercent to 90 and we have 
left it unchanged in our experiments.

Based on my expertise level in Mahout it looks to me like Mahout 0.6 was broken 
and that this is not a regression. It would be nice if someone who is more 
familiar with this part of codebase confirms my evaluation of this problem.

Thanks,
-Shrinivas

-----Original Message-----
From: Joshi, Shrinivas [mailto:[email protected]] 
Sent: Thursday, July 12, 2012 12:21 AM
To: [email protected]
Subject: RE: Potential regression in ASFEmail KMeans clustering

I spent some more time looking in to this. I think the problem here might be 
more than just the value of norm that gets passed to createTermFrequencyVectors 
methods. I will update this thread I can root cause this issue. I would 
appreciate if someone has any pointers/suggestions about what might be the 
issue here.

-Shrinivas

-----Original Message-----
From: Ted Dunning [mailto:[email protected]]
Sent: Wednesday, July 11, 2012 7:02 PM
To: [email protected]
Subject: Re: Potential regression in ASFEmail KMeans clustering

Speaking without actually looking into this, I would say that a -1 norm doesn't 
make good sense.  If the default value of the norm exponent changed to -1, it 
would wreak various kinds of havoc and would be a Bad Thing(tm).

However, regardless of that, since I haven't looked at the code in question my 
comment has a kind of low chance of being on target.  It is just that what you 
said does sound very plausible.

On Wed, Jul 11, 2012 at 3:14 PM, Joshi, Shrinivas
<[email protected]>wrote:

> ... Basically, with the current trunk and with the norm value of -1.0f 
> (which is what gets passed to 
> DictionaryVectorizer.createTermFrequencyVectors method in case 
> processIdf Boolean is true) I see no difference in the size of 
> tf-vectors and tf-idf vectors. ...
>
> If I pass norm value of 2.0f to
> DictionaryVectorizer.createTermFrequencyVectors method in the current 
> trunk then I do not see the regression.
>

RE: Potential regression in ASFEmail KMeans clustering

Reply via email to