spectralkmeans utility fails when input filename begins with leading underscore
-------------------------------------------------------------------------------

                 Key: MAHOUT-978
                 URL: https://issues.apache.org/jira/browse/MAHOUT-978
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.6
         Environment: Tested on a real Linux-based cluster running Hadoop 
0.20.2-cdh3u2 and the 0.6 release; also OSX pseudo cluster running Hadoop 
0.20.203.0 running 16 Feb trunk build.


            Reporter: Dan Brickley
            Priority: Minor


The commandline 'bin/mahout spectralkmeans' utility fails with 
NoSuchElementException after "Loading vector from: 
spectral/output/results2/calculations/diagonal/part-r-00000"  when input data 
in hdfs has filename beginning with a leading underscore.

This was partially reported in comments for MAHOUT-524 but I believe identified 
now as a distinct issue (thanks to Shannon for help diagnosing). I have not 
investigated if there is an equivalent problem for API-based use of this piece 
of Mahout.

Steps to reproduce: 

1. put affinity file into hdfs, following 
https://cwiki.apache.org/MAHOUT/spectral-clustering.html - note that node IDs 
count from zero etc. Name your file with a leading underscore. For example, try 
http://danbri.org/2012/spectral/dbpedia/_topic_skm.csv and store it in 
spectral/input/_topic_skm.csv

(I'll leave that example input file in place unchanged for others to try. It is 
built from dbpedia data, encoding associations from Wikipedia pages to 
categories. Whether it is a good use of spectral clustering I'm not sure, but 
I'd at least hope the job would run to completion.)

2. Run 'mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o 
spectral/output/results1'

3. Wait for it to fail just after printing "Loading vector from: 
spectral/output/results1/calculations/diagonal/part-r-00000", with 
java.util.NoSuchElementException at 
com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152).

4. Rename the file in hdfs to eliminate the leading underscore. Re-run the 
command (give a different results dir or cleanup from the first run, to avoid 
mixing the tests). This attempt should succeed and you'll see it proceed deeper 
into the job, i.e. something like 

12/02/19 14:38:32 INFO common.VectorCache: Loading vector from: 
spectral/output/results2/calculations/diagonal/part-r-00000
12/02/19 14:38:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
12/02/19 14:38:43 INFO input.FileInputFormat: Total input paths to process : 1
12/02/19 14:38:44 INFO mapred.JobClient: Running job: job_201202191410_0005
12/02/19 14:38:45 INFO mapred.JobClient:  map 0% reduce 0%
12/02/19 14:39:31 INFO mapred.JobClient:  map 1% reduce 0%

(5. You might get a memory-based failure some time later; that is a separate 
problem.)

I'll attach a more detailed transcript. I've made no attempt to diagnose 
internals yet, but did make some other tests and can confirm that it does not 
seem to matter whether the commandline invocation names the file explicitly, or 
by directory name only. Also trailing slash does not seem to be an issue. 
Finally, a related 'gotcha': make sure the results directory is not inside the 
input directory when testing.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to