Robin, Do you remember if this test ran successfully to completion? If not, I'll submit a JIRA when I've a complete log of a failed run...
Dan ---------- Forwarded message ---------- From: Grant Ingersoll <[email protected]> Date: 21 June 2012 21:33 Subject: Re: Spectral Kmeans wiki category data test - can you confirm if you ran it to completion? To: Dan Brickley <[email protected]> Cc: Shannon Quinn <[email protected]> I'd ask on dev@, as Robin was actually the one who ran it. On Jun 21, 2012, at 3:15 PM, Dan Brickley wrote: Hi With the patch https://issues.apache.org/jira/browse/MAHOUT-986 in 0.7, this doesn't die so quickly ... but I'm still not seeing it run to completion. Using the template commandline you suggested, 'bin/mahout spectralkmeans -k 20 -d 4192499 -x 7 -i path/to/csv/file/ -o your/output/path/ I've seen it fail with -k 20, and -k 10 Unfortunately I was running this in a screen session without proper logging and I want to double-check everything before reporting so I'm re-running with -k 10 now and will file a bug if it fails, ... but meanwhile I wanted to check in with you to see if you'd had a successful run. I'm testing with the 0.7 distro. The failure was an IndexException, here's the -k 20 version, mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o spectral/output/ 12/06/19 19:33:11 INFO lanczos.LanczosSolver: 20 passes through the corpus so far... Exception in thread "main" org.apache.mahout.math.IndexException: Index 20 is outside allowable range of [0,20) at org.apache.mahout.math.AbstractMatrix.set(AbstractMatrix.java:479) at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:132) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.runJob(DistributedLanczosSolver.java:73) at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:148) at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86) It's barfing out here, // Next step: perform eigen-decomposition using LanczosSolver // since some of the eigen-output is spurious and will be eliminated // upon verification, we have to aim to overshoot and then discard // unnecessary vectors later int overshoot = (int) ((double) clusters * OVERSHOOT_MULTIPLIER); DistributedLanczosSolver solver = new DistributedLanczosSolver(); LanczosState state = new LanczosState(L, overshoot, solver.getInitialVector(L)); Path lanczosSeqFiles = new Path(outputCalc, "eigenvectors-" + (System.nanoTime() & 0xFF)); solver.runJob(conf, state, overshoot, true, lanczosSeqFiles.toString()); With -k 10 I got "12/06/20 20:51:15 INFO lanczos.LanczosSolver: 10 passes through the corpus so far... Exception in thread "main" org.apache.mahout.math.IndexException: Index 10 is outside allowable range of [0,10) at org.apache.mahout.math.AbstractMatrix.set(AbstractMatrix.java:479)". ...although the logs also showed "12/06/20 20:40:18 INFO lanczos.LanczosSolver: Finding 20 singular vectors of matrix with 4192499 rows, via Lanczos" which confused me until Shannon reminded me of the overshoot. I'm happy to +cc the mailing lists but for starters thought I'd check to see if the test run had succeeded for you; if so, maybe I've some local problem. Dan -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
