It looks like this regression is caused by incorrect processing of IDFs. The 
difference I noticed between current trunk and Mahout 0.6 release related to 
this part of the code appears to be in 
core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
 file.

I am not sure whether the changes in this class were part of a valid change or 
were accidental. Following patch seems to address the regression that we are 
seeing. Execution time of iteration jobs are now less than 3 mins on our test 
cluster. Without this change we see them taking as much as 1hr 20 mins.

Let me know if I am missing something here. 

Index: 
core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
===================================================================
--- 
core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
 (revision 1359759)
+++ 
core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
 (working copy)
@@ -268,7 +268,7 @@
           ? DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER+"-toprune"
           : DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER;

-      if (processIdf) {
+      if (!processIdf) {
         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
                                                         outputDir,
                                                         tfDirName,

Thanks,
-Shrinivas

-----Original Message-----
From: Joshi, Shrinivas [mailto:[email protected]] 
Sent: Friday, July 06, 2012 5:18 PM
To: [email protected]
Subject: Potential regression in ASFEmail KMeans clustering

Just wanted to find out if this is a known/expected behavior with Mahout trunk. 
We are noticing that the KMeans iteration jobs that are part of the ASFEmail 
sample are taking longer to execute compared to Mahout 0.6 release. Using 
Mahout 0.6 release on the test cluster that we have, we see these jobs/steps 
taking not more than 6-7 minutes. However, with the  trunk code that I checked 
out few days back it is taking anywhere between 25mins to 50mins. Has anybody 
else seen something similar?

Thanks,
-Shrinivas

Reply via email to