term vectors not created in SparseVectorsFromSequenceFiles using tf weighting 
and maxDFSigma filtering
------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-957
                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.6
            Reporter: John Conwell
             Fix For: 0.6


The SparseVectorsFromSequenceFiles throws an exception when you want term 
frequency vectors output, with the maxDFSigma filtering option.

Basically the if / else if section shown below, will skip calling 
DictionaryVectorizer.createTermFrequencyVectors when have that combination.  
The condition will create vectors when you want tf vectors without maxDFSigma 
filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf 
vectors with maxDFSigma filtering, it totally skips over the call to 
createTermFrequencyVectors, and later on throws an exception because the vector 
input path doesn't exist.

For example, the following cmd line will reproduce this situation:
bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o 
/Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 
--minDF 2 --maxDFSigma 3 -seq

//the suspect code at line ~267 in 
DictionaryVectorizer.createTermFrequencyVectors
if (!processIdf && !shouldPrune) {
        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, 
outputDir, tfDirName, conf, minSupport, maxNGramSize,
          minLLRValue, norm, logNormalize, reduceTasks, chunkSize, 
sequentialAccessOutput, namedVectors);
} else if (processIdf) {
        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, 
outputDir, tfDirName, conf, minSupport, maxNGramSize,
          minLLRValue, -1.0f, false, reduceTasks, chunkSize, 
sequentialAccessOutput, namedVectors);
}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to