Hi Sean, Thanks for the feedback. I did notice changes in the loop bodies, but missed to see that they were swapped with each other.
Anyways, I think the regression is caused because of the difference in how norm value makes an effect in 0.6 vs the current trunk. Basically, with the current trunk and with the norm value of -1.0f (which is what gets passed to DictionaryVectorizer.createTermFrequencyVectors method in case processIdf Boolean is true) I see no difference in the size of tf-vectors and tf-idf vectors. We are using 12 reducers for seq2sparse jobs and each file in tf-vectors folder is around 630MB. Files in tfidf-vectors folder are more or less the same size. However, with Mahout 0.6, I see tf-vectors have a size of around 630MB whereas tdidf-vectors size goes down to about 100MB. If I pass norm value of 2.0f to DictionaryVectorizer.createTermFrequencyVectors method in the current trunk then I do not see the regression. It appears that there has been some change in trunk code in the way norm value gets handled. May be 0.6 was handling norm value incorrectly? I do not have sufficient background in ML/Mahout to conclude anything here. Please let me know if you have any feedback. Thanks, -Shrinivas -----Original Message----- From: Sean Owen [mailto:[email protected]] Sent: Wednesday, July 11, 2012 3:35 AM To: [email protected] Subject: Re: Potential regression in ASFEmail KMeans clustering Oops, hit enter too early -- This changes if (!foo) { doBar(); } else { doFoo(); } to if (foo) { doFoo(); } else { doBar(); } (among other changes) for readability. I imagine there could be a problem here but this isn't the change that did it. On Wed, Jul 11, 2012 at 9:33 AM, Sean Owen <[email protected]> wrote: > I made the change on the line in question, but it can't be the problem > since it did not change the functionality. To see that you have to > look at the rest of the change. It is changing... > > if (! ) { > A; > } else { > B > } > > to > > if (foo) { > B >
