[
https://issues.apache.org/jira/browse/MAHOUT-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Ingersoll resolved MAHOUT-978.
------------------------------------
Resolution: Won't Fix
I'd say, won't fix, as there is a workaround. Please re-open if there is a
specific patch.
> spectralkmeans utility fails when input filename begins with leading
> underscore
> -------------------------------------------------------------------------------
>
> Key: MAHOUT-978
> URL: https://issues.apache.org/jira/browse/MAHOUT-978
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: Tested on a real Linux-based cluster running Hadoop
> 0.20.2-cdh3u2 and the 0.6 release; also OSX pseudo cluster running Hadoop
> 0.20.203.0 running 16 Feb trunk build.
> Reporter: Dan Brickley
> Priority: Minor
> Attachments: jira-underscore-spectral-log.txt
>
>
> The commandline 'bin/mahout spectralkmeans' utility fails with
> NoSuchElementException after "Loading vector from:
> spectral/output/results2/calculations/diagonal/part-r-00000" when input data
> in hdfs has filename beginning with a leading underscore.
> This was partially reported in comments for MAHOUT-524 but I believe
> identified now as a distinct issue (thanks to Shannon for help diagnosing). I
> have not investigated if there is an equivalent problem for API-based use of
> this piece of Mahout.
> Steps to reproduce:
> 1. put affinity file into hdfs, following
> https://cwiki.apache.org/MAHOUT/spectral-clustering.html - note that node IDs
> count from zero etc. Name your file with a leading underscore. For example,
> try http://danbri.org/2012/spectral/dbpedia/_topic_skm.csv and store it in
> spectral/input/_topic_skm.csv
> (I'll leave that example input file in place unchanged for others to try. It
> is built from dbpedia data, encoding associations from Wikipedia pages to
> categories. Whether it is a good use of spectral clustering I'm not sure, but
> I'd at least hope the job would run to completion.)
> 2. Run 'mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o
> spectral/output/results1'
> 3. Wait for it to fail just after printing "Loading vector from:
> spectral/output/results1/calculations/diagonal/part-r-00000", with
> java.util.NoSuchElementException at
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152).
> 4. Rename the file in hdfs to eliminate the leading underscore. Re-run the
> command (give a different results dir or cleanup from the first run, to avoid
> mixing the tests). This attempt should succeed and you'll see it proceed
> deeper into the job, i.e. something like
> 12/02/19 14:38:32 INFO common.VectorCache: Loading vector from:
> spectral/output/results2/calculations/diagonal/part-r-00000
> 12/02/19 14:38:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing
> the arguments. Applications should implement Tool for the same.
> 12/02/19 14:38:43 INFO input.FileInputFormat: Total input paths to process : 1
> 12/02/19 14:38:44 INFO mapred.JobClient: Running job: job_201202191410_0005
> 12/02/19 14:38:45 INFO mapred.JobClient: map 0% reduce 0%
> 12/02/19 14:39:31 INFO mapred.JobClient: map 1% reduce 0%
> (5. You might get a memory-based failure some time later; that is a separate
> problem.)
> I'll attach a more detailed transcript. I've made no attempt to diagnose
> internals yet, but did make some other tests and can confirm that it does not
> seem to matter whether the commandline invocation names the file explicitly,
> or by directory name only. Also trailing slash does not seem to be an issue.
> Finally, a related 'gotcha': make sure the results directory is not inside
> the input directory when testing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira