[ 
https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194377#comment-13194377
 ] 

Grant Ingersoll commented on MAHOUT-957:
----------------------------------------

OK, I can reproduce the bug: {quote} Exception in thread "main" 
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: file:/tmp/foo/tf-vectors-toprune at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
 at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
 at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
 at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919) at 
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936) at 
org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at 
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854) at 
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807) at 
org.apache.hadoop.mapreduce.Job.submit(Job.java:465) at 
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495) at 
org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:366)
 at 
org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
 at 
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:277)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at 
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597) at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at 
org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) {quote}
                
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting 
> and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term 
> frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling 
> DictionaryVectorizer.createTermFrequencyVectors when have that combination.  
> The condition will create vectors when you want tf vectors without maxDFSigma 
> filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf 
> vectors with maxDFSigma filtering, it totally skips over the call to 
> createTermFrequencyVectors, and later on throws an exception because the 
> vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o 
> /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 
> --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in 
> DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, 
> outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, 
> sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, 
> outputDir, tfDirName, conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, 
> sequentialAccessOutput, namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to