Hi, Looks like Mahout 0.6 was eliminating more vectors during tfidf processing than what the current trunk does. Changes in https://issues.apache.org/jira/browse/MAHOUT-973 cause the increased size of tfidf-vectors mentioned below and hence the longer execution times for clustering iteration jobs as compared to Mahout 0.6. The asf-email example script that comes with the distribution sets maxDFPercent to 90 and we have left it unchanged in our experiments.
Based on my expertise level in Mahout it looks to me like Mahout 0.6 was broken and that this is not a regression. It would be nice if someone who is more familiar with this part of codebase confirms my evaluation of this problem. Thanks, -Shrinivas -----Original Message----- From: Joshi, Shrinivas [mailto:[email protected]] Sent: Thursday, July 12, 2012 12:21 AM To: [email protected] Subject: RE: Potential regression in ASFEmail KMeans clustering I spent some more time looking in to this. I think the problem here might be more than just the value of norm that gets passed to createTermFrequencyVectors methods. I will update this thread I can root cause this issue. I would appreciate if someone has any pointers/suggestions about what might be the issue here. -Shrinivas -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Wednesday, July 11, 2012 7:02 PM To: [email protected] Subject: Re: Potential regression in ASFEmail KMeans clustering Speaking without actually looking into this, I would say that a -1 norm doesn't make good sense. If the default value of the norm exponent changed to -1, it would wreak various kinds of havoc and would be a Bad Thing(tm). However, regardless of that, since I haven't looked at the code in question my comment has a kind of low chance of being on target. It is just that what you said does sound very plausible. On Wed, Jul 11, 2012 at 3:14 PM, Joshi, Shrinivas <[email protected]>wrote: > ... Basically, with the current trunk and with the norm value of -1.0f > (which is what gets passed to > DictionaryVectorizer.createTermFrequencyVectors method in case > processIdf Boolean is true) I see no difference in the size of > tf-vectors and tf-idf vectors. ... > > If I pass norm value of 2.0f to > DictionaryVectorizer.createTermFrequencyVectors method in the current > trunk then I do not see the regression. >
