Hi Sean,

Thanks for the feedback. I did notice changes in the loop bodies, but missed to 
see that they were swapped with each other.

Anyways, I think the regression is caused because of the difference in how norm 
value makes an effect in 0.6 vs the current trunk. Basically, with the current 
trunk and with the norm value of -1.0f (which is what gets passed to 
DictionaryVectorizer.createTermFrequencyVectors method in case processIdf 
Boolean is true) I see no difference in the size of tf-vectors and tf-idf 
vectors. We are using 12 reducers for seq2sparse jobs and each file in 
tf-vectors folder is around 630MB. Files in tfidf-vectors folder are more or 
less the same size. 

However, with Mahout 0.6, I see tf-vectors have a size of around 630MB whereas 
tdidf-vectors size goes down to about 100MB.

If I pass norm value of 2.0f to DictionaryVectorizer.createTermFrequencyVectors 
method in the current trunk then I do not see the regression.

It appears that there has been some change in trunk code in the way norm value 
gets handled. May be 0.6 was handling norm value incorrectly? I do not have 
sufficient background in ML/Mahout to conclude anything here. 

Please let me know if you have any  feedback.

Thanks,
-Shrinivas

-----Original Message-----
From: Sean Owen [mailto:[email protected]] 
Sent: Wednesday, July 11, 2012 3:35 AM
To: [email protected]
Subject: Re: Potential regression in ASFEmail KMeans clustering

Oops, hit enter too early --

This changes

if (!foo) {
  doBar();
} else {
  doFoo();
}

to

if (foo) {
  doFoo();
} else {
  doBar();
}

(among other changes) for readability.

I imagine there could be a problem here but this isn't the change that did it.


On Wed, Jul 11, 2012 at 9:33 AM, Sean Owen <[email protected]> wrote:
> I made the change on the line in question, but it can't be the problem 
> since it did not change the functionality. To see that you have to 
> look at the rest of the change. It is changing...
>
> if (! ) {
>   A;
> } else {
>   B
> }
>
> to
>
> if (foo) {
>   B
>

Reply via email to