logical error with term counting in 
org.apache.mahout.vectorizer.DictionaryVectorizer
-------------------------------------------------------------------------------------

                 Key: MAHOUT-808
                 URL: https://issues.apache.org/jira/browse/MAHOUT-808
             Project: Mahout
          Issue Type: Bug
    Affects Versions: 0.5
            Reporter: Phil
            Priority: Critical


when using mahout lda for topic modeling, creating vectors from SequenceFile is 
essential, (refer to 
https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text) 
but when the --minSupport was set a little bit larger, I found the term 
counting not right --- there is a logical error at 
org.apache.mahout.vectorizer.DictionaryVectorizer.java:line 335
    job.setCombinerClass(TermCountReducer.class);

Now turn to line 41 at org.apache.mahout.vectorizer.term.TermCountReducer.java
    if (sum >= minSupport) {
      context.write(key, new LongWritable(sum));
    }

so some terms would be filtered at Combiner even though they actually could 
pass through, absolutely this is not what we've expected.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to