Re: Clustering from DB

Grant Ingersoll Fri, 10 Jul 2009 19:25:13 -0700

Hmm, that might be a mistake on my part when trying to resolve howHadoop 0.20 now resolves globs. I somewhat blindly applied "/*" whereneeded, but I think it is likely worth revistiing here where aspecific file is needed?


-Grant


On Jul 10, 2009, at 3:08 PM, nfantone wrote:

This error is still bugging me. The exception:

WARNING: java.io.FileNotFoundException: File
output/clusters-0/part-00000/* does not exist.
java.io.FileNotFoundException: File output/clusters-0/part-00000/*
does not exist.

ocurrs first at:
org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
which corresponds to:

 private static boolean isConverged(String filePath, JobConf conf,
FileSystem fs)
     throws IOException {
   Path outPart = new Path(filePath + "/*");
   SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
conf);  <-- THIS
   ...
 }

where isConverged() is called in this fashion:

return isConverged(clustersOut + "/part-00000", conf, fs);

by runIteration(), which is previously invoked by runJob() like:

    String clustersOut = output + "/clusters-" + iteration;
converged = runIteration(input, clustersIn, clustersOut,measureClass,
         delta, numReduceTasks, iteration);

Consequently, assuming its the first iteration and the output folder
has been named "output" by the user, the SequenceFile.Reader receives
"output/clusters-0/part-00000/*" as a path, which is non-existent. I
believe the path should end in "part-00000" and the  + "/*" should be
removed... although someone, evidently, thought otherwise.

Any feedback?

On Mon, Jul 6, 2009 at 5:39 PM, nfantone<[email protected]> wrote:
I was using Canopy to create input clusters, but the error appeared
while running kMeans (if I run kMeans' job only with previously
created clusters from Canopy placed in output/canopies as initial
clusters, it still fails). I noticed no other problems. I was using
revision 790979 before updating.  Strangely, there were no changes in
the job and drivers class from that revision. svn diff shows that the
only classes that changed in org.apache.mahout.clustering.kmeans
package were KMeansInfo.java and RandomSeedGenerator.java
On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<[email protected]> wrote:
Hum, no, it's looking for the output of the first iteration. Werethereother errors? What was the last revision you were running? It doeslook likesomething got horked, as it should be looking for output/clusters-0/*. Can
you diff the job and driver class to see what changed?

Jeff

nfantone wrote:
Fellows, today I updated to revision 791558 and while runningkMeans I
got the following exception:

WARNING: java.io.FileNotFoundException: File
output/clusters-0/part-00000/* does not exist.
java.io.FileNotFoundException: File output/clusters-0/part-00000/*
does not exist.

The algorithm isn't interrupted, though. But this exception wasn't
thrown before the update and, to me, its message is not quiteclear.It seems as it's looking for any file inside a "part-00000"directory,which doesn't exist; and, as far as I know, "part-xxxxx" aredefault
names for output files.

I could show the entire stack trace, if needed. Any pointers?


On Thu, Jul 2, 2009 at 3:16 PM, nfantone<[email protected]> wrote:
Thanks for the feedback, Jeff.
The logical format of input to KMeans is <Key, Vector> as it isin
sequence
file format, but the Key is never used. To my knowledge, thereis norequirement to assign identifiers to the input points*. Usersare free
to
associate an arbitrary name field with each vector - also labelmappings
may
be assigned - but these are not manipulated by KMeans or any ofthe
other
clustering applications. The name field is now used as a vector
identifier
by the KMeansClusterMapper - if it is non-null - in the outputstep
only.
The key may not be used internally, but externally they canprove tobe pretty useful. For me, keys are userIDs and each Vectorrepresents
his/her historical behavior. Being able to collect the output
information as <UserID, ClusterID> is quite neat as it allows meto,
for instance, retrieve user information using data directly from a
HDFS file's field.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Clustering from DB

Reply via email to