Great work. It works like a charm now. Thank you very much.
On Mon, Jul 13, 2009 at 1:41 PM, Jeff Eastman<[email protected]> wrote: > r793620 fixes the KMeansDriver.isConverged() method to iterate over all > cluster part files. Unit test now runs without error and the synthetic > control job completes too. > > > Jeff Eastman wrote: >> >> In this case, the code should be reading all of the clusters into memory >> to see if they have all converged. These may be split into multiple part >> files if more than one reducer is specified. So /* is the correct file >> pattern and it is the calling site that should remove the /part-0000 >> reference. The code in isConverged should loop through all the parts, >> returning if they have all converged or not. >> >> I'll take a detailed look tomorrow. >> >> >> Grant Ingersoll wrote: >>> >>> Hmm, that might be a mistake on my part when trying to resolve how Hadoop >>> 0.20 now resolves globs. I somewhat blindly applied "/*" where needed, but >>> I think it is likely worth revistiing here where a specific file is needed? >>> >>> -Grant >>> >>> On Jul 10, 2009, at 3:08 PM, nfantone wrote: >>> >>>> This error is still bugging me. The exception: >>>> >>>> WARNING: java.io.FileNotFoundException: File >>>> output/clusters-0/part-00000/* does not exist. >>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/* >>>> does not exist. >>>> >>>> ocurrs first at: >>>> >>>> >>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298) >>>> >>>> which corresponds to: >>>> >>>> private static boolean isConverged(String filePath, JobConf conf, >>>> FileSystem fs) >>>> throws IOException { >>>> Path outPart = new Path(filePath + "/*"); >>>> SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart, >>>> conf); <-- THIS >>>> ... >>>> } >>>> >>>> where isConverged() is called in this fashion: >>>> >>>> return isConverged(clustersOut + "/part-00000", conf, fs); >>>> >>>> by runIteration(), which is previously invoked by runJob() like: >>>> >>>> String clustersOut = output + "/clusters-" + iteration; >>>> converged = runIteration(input, clustersIn, clustersOut, >>>> measureClass, >>>> delta, numReduceTasks, iteration); >>>> >>>> Consequently, assuming its the first iteration and the output folder >>>> has been named "output" by the user, the SequenceFile.Reader receives >>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I >>>> believe the path should end in "part-00000" and the + "/*" should be >>>> removed... although someone, evidently, thought otherwise. >>>> >>>> Any feedback? >>>> >>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<[email protected]> wrote: >>>>> >>>>> I was using Canopy to create input clusters, but the error appeared >>>>> while running kMeans (if I run kMeans' job only with previously >>>>> created clusters from Canopy placed in output/canopies as initial >>>>> clusters, it still fails). I noticed no other problems. I was using >>>>> revision 790979 before updating. Strangely, there were no changes in >>>>> the job and drivers class from that revision. svn diff shows that the >>>>> only classes that changed in org.apache.mahout.clustering.kmeans >>>>> package were KMeansInfo.java and RandomSeedGenerator.java >>>>> >>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff >>>>> Eastman<[email protected]> wrote: >>>>>> >>>>>> Hum, no, it's looking for the output of the first iteration. Were >>>>>> there >>>>>> other errors? What was the last revision you were running? It does >>>>>> look like >>>>>> something got horked, as it should be looking for output/clusters-0/*. >>>>>> Can >>>>>> you diff the job and driver class to see what changed? >>>>>> >>>>>> Jeff >>>>>> >>>>>> nfantone wrote: >>>>>>> >>>>>>> Fellows, today I updated to revision 791558 and while running kMeans >>>>>>> I >>>>>>> got the following exception: >>>>>>> >>>>>>> WARNING: java.io.FileNotFoundException: File >>>>>>> output/clusters-0/part-00000/* does not exist. >>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/* >>>>>>> does not exist. >>>>>>> >>>>>>> The algorithm isn't interrupted, though. But this exception wasn't >>>>>>> thrown before the update and, to me, its message is not quite clear. >>>>>>> It seems as it's looking for any file inside a "part-00000" >>>>>>> directory, >>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default >>>>>>> names for output files. >>>>>>> >>>>>>> I could show the entire stack trace, if needed. Any pointers? >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<[email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> Thanks for the feedback, Jeff. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in >>>>>>>>> sequence >>>>>>>>> file format, but the Key is never used. To my knowledge, there is >>>>>>>>> no >>>>>>>>> requirement to assign identifiers to the input points*. Users are >>>>>>>>> free >>>>>>>>> to >>>>>>>>> associate an arbitrary name field with each vector - also label >>>>>>>>> mappings >>>>>>>>> may >>>>>>>>> be assigned - but these are not manipulated by KMeans or any of the >>>>>>>>> other >>>>>>>>> clustering applications. The name field is now used as a vector >>>>>>>>> identifier >>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step >>>>>>>>> only. >>>>>>>>> >>>>>>>> >>>>>>>> The key may not be used internally, but externally they can prove to >>>>>>>> be pretty useful. For me, keys are userIDs and each Vector >>>>>>>> represents >>>>>>>> his/her historical behavior. Being able to collect the output >>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me to, >>>>>>>> for instance, retrieve user information using data directly from a >>>>>>>> HDFS file's field. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>> >>> -------------------------- >>> Grant Ingersoll >>> http://www.lucidimagination.com/ >>> >>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >>> Solr/Lucene: >>> http://www.lucidimagination.com/search >>> >>> >>> >> >> >> > >
