[ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195610#comment-13195610 ]
Hudson commented on MAHOUT-957: ------------------------------- Integrated in Mahout-Quality #1325 (See [https://builds.apache.org/job/Mahout-Quality/1325/]) MAHOUT-957: handle pruning of tf weights gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1237072 Files : * /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java * /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFilesTest.java > term vectors not created in SparseVectorsFromSequenceFiles using tf weighting > and maxDFSigma filtering > ------------------------------------------------------------------------------------------------------ > > Key: MAHOUT-957 > URL: https://issues.apache.org/jira/browse/MAHOUT-957 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.6 > Reporter: John Conwell > Assignee: Grant Ingersoll > Fix For: 0.6 > > Attachments: MAHOUT-957.patch > > > The SparseVectorsFromSequenceFiles throws an exception when you want term > frequency vectors output, with the maxDFSigma filtering option. > Basically the if / else if section shown below, will skip calling > DictionaryVectorizer.createTermFrequencyVectors when have that combination. > The condition will create vectors when you want tf vectors without maxDFSigma > filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf > vectors with maxDFSigma filtering, it totally skips over the call to > createTermFrequencyVectors, and later on throws an exception because the > vector input path doesn't exist. > For example, the following cmd line will reproduce this situation: > bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o > /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 > --minDF 2 --maxDFSigma 3 -seq > //the suspect code at line ~267 in > DictionaryVectorizer.createTermFrequencyVectors > if (!processIdf && !shouldPrune) { > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, > outputDir, tfDirName, conf, minSupport, maxNGramSize, > minLLRValue, norm, logNormalize, reduceTasks, chunkSize, > sequentialAccessOutput, namedVectors); > } else if (processIdf) { > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, > outputDir, tfDirName, conf, minSupport, maxNGramSize, > minLLRValue, -1.0f, false, reduceTasks, chunkSize, > sequentialAccessOutput, namedVectors); > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira