Yes. After taking another look into it, I'll tend yo agree with Jeff here. isConverged() should be receiving an absolute path to a directory containing all the clusters, which could have been split into several parts.
I'll also look into that tomorrow, at work. On Sun, Jul 12, 2009 at 7:51 PM, Jeff Eastman<[email protected]> wrote: > In this case, the code should be reading all of the clusters into memory to > see if they have all converged. These may be split into multiple part files > if more than one reducer is specified. So /* is the correct file pattern and > it is the calling site that should remove the /part-0000 reference. The code > in isConverged should loop through all the parts, returning if they have all > converged or not. > > I'll take a detailed look tomorrow. > > > Grant Ingersoll wrote: >> >> Hmm, that might be a mistake on my part when trying to resolve how Hadoop >> 0.20 now resolves globs. I somewhat blindly applied "/*" where needed, but >> I think it is likely worth revistiing here where a specific file is needed? >> >> -Grant >> >> On Jul 10, 2009, at 3:08 PM, nfantone wrote: >> >>> This error is still bugging me. The exception: >>> >>> WARNING: java.io.FileNotFoundException: File >>> output/clusters-0/part-00000/* does not exist. >>> java.io.FileNotFoundException: File output/clusters-0/part-00000/* >>> does not exist. >>> >>> ocurrs first at: >>> >>> >>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298) >>> >>> which corresponds to: >>> >>> private static boolean isConverged(String filePath, JobConf conf, >>> FileSystem fs) >>> throws IOException { >>> Path outPart = new Path(filePath + "/*"); >>> SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart, >>> conf); <-- THIS >>> ... >>> } >>> >>> where isConverged() is called in this fashion: >>> >>> return isConverged(clustersOut + "/part-00000", conf, fs); >>> >>> by runIteration(), which is previously invoked by runJob() like: >>> >>> String clustersOut = output + "/clusters-" + iteration; >>> converged = runIteration(input, clustersIn, clustersOut, >>> measureClass, >>> delta, numReduceTasks, iteration); >>> >>> Consequently, assuming its the first iteration and the output folder >>> has been named "output" by the user, the SequenceFile.Reader receives >>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I >>> believe the path should end in "part-00000" and the + "/*" should be >>> removed... although someone, evidently, thought otherwise. >>> >>> Any feedback? >>> >>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<[email protected]> wrote: >>>> >>>> I was using Canopy to create input clusters, but the error appeared >>>> while running kMeans (if I run kMeans' job only with previously >>>> created clusters from Canopy placed in output/canopies as initial >>>> clusters, it still fails). I noticed no other problems. I was using >>>> revision 790979 before updating. Strangely, there were no changes in >>>> the job and drivers class from that revision. svn diff shows that the >>>> only classes that changed in org.apache.mahout.clustering.kmeans >>>> package were KMeansInfo.java and RandomSeedGenerator.java >>>> >>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<[email protected]> >>>> wrote: >>>>> >>>>> Hum, no, it's looking for the output of the first iteration. Were there >>>>> other errors? What was the last revision you were running? It does look >>>>> like >>>>> something got horked, as it should be looking for output/clusters-0/*. >>>>> Can >>>>> you diff the job and driver class to see what changed? >>>>> >>>>> Jeff >>>>> >>>>> nfantone wrote: >>>>>> >>>>>> Fellows, today I updated to revision 791558 and while running kMeans I >>>>>> got the following exception: >>>>>> >>>>>> WARNING: java.io.FileNotFoundException: File >>>>>> output/clusters-0/part-00000/* does not exist. >>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/* >>>>>> does not exist. >>>>>> >>>>>> The algorithm isn't interrupted, though. But this exception wasn't >>>>>> thrown before the update and, to me, its message is not quite clear. >>>>>> It seems as it's looking for any file inside a "part-00000" directory, >>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default >>>>>> names for output files. >>>>>> >>>>>> I could show the entire stack trace, if needed. Any pointers? >>>>>> >>>>>> >>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<[email protected]> wrote: >>>>>> >>>>>>> >>>>>>> Thanks for the feedback, Jeff. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in >>>>>>>> sequence >>>>>>>> file format, but the Key is never used. To my knowledge, there is no >>>>>>>> requirement to assign identifiers to the input points*. Users are >>>>>>>> free >>>>>>>> to >>>>>>>> associate an arbitrary name field with each vector - also label >>>>>>>> mappings >>>>>>>> may >>>>>>>> be assigned - but these are not manipulated by KMeans or any of the >>>>>>>> other >>>>>>>> clustering applications. The name field is now used as a vector >>>>>>>> identifier >>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step >>>>>>>> only. >>>>>>>> >>>>>>> >>>>>>> The key may not be used internally, but externally they can prove to >>>>>>> be pretty useful. For me, keys are userIDs and each Vector represents >>>>>>> his/her historical behavior. Being able to collect the output >>>>>>> information as <UserID, ClusterID> is quite neat as it allows me to, >>>>>>> for instance, retrieve user information using data directly from a >>>>>>> HDFS file's field. >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >> Solr/Lucene: >> http://www.lucidimagination.com/search >> >> >> > >
