It looks like this regression is caused by incorrect processing of IDFs. The
difference I noticed between current trunk and Mahout 0.6 release related to
this part of the code appears to be in
core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
file.
I am not sure whether the changes in this class were part of a valid change or
were accidental. Following patch seems to address the regression that we are
seeing. Execution time of iteration jobs are now less than 3 mins on our test
cluster. Without this change we see them taking as much as 1hr 20 mins.
Let me know if I am missing something here.
Index:
core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
===================================================================
---
core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
(revision 1359759)
+++
core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
(working copy)
@@ -268,7 +268,7 @@
? DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER+"-toprune"
: DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER;
- if (processIdf) {
+ if (!processIdf) {
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir,
tfDirName,
Thanks,
-Shrinivas
-----Original Message-----
From: Joshi, Shrinivas [mailto:[email protected]]
Sent: Friday, July 06, 2012 5:18 PM
To: [email protected]
Subject: Potential regression in ASFEmail KMeans clustering
Just wanted to find out if this is a known/expected behavior with Mahout trunk.
We are noticing that the KMeans iteration jobs that are part of the ASFEmail
sample are taking longer to execute compared to Mahout 0.6 release. Using
Mahout 0.6 release on the test cluster that we have, we see these jobs/steps
taking not more than 6-7 minutes. However, with the trunk code that I checked
out few days back it is taking anywhere between 25mins to 50mins. Has anybody
else seen something similar?
Thanks,
-Shrinivas