Re: Clustering from DB

nfantone Mon, 13 Jul 2009 10:40:12 -0700

Great work. It works like a charm now. Thank you very much.


On Mon, Jul 13, 2009 at 1:41 PM, Jeff Eastman<[email protected]> wrote:
> r793620 fixes the KMeansDriver.isConverged() method to iterate over all
> cluster part files. Unit test now runs without error and the synthetic
> control job completes too.
>
>
> Jeff Eastman wrote:
>>
>> In this case, the code should be reading all of the clusters into memory
>> to see if they have all converged. These may be split into multiple part
>> files if more than one reducer is specified. So /* is the correct file
>> pattern and it is the calling site that should remove the /part-0000
>> reference. The code in isConverged should loop through all the parts,
>> returning if they have all converged or not.
>>
>> I'll take a detailed look tomorrow.
>>
>>
>> Grant Ingersoll wrote:
>>>
>>> Hmm, that might be a mistake on my part when trying to resolve how Hadoop
>>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where needed, but
>>> I think it is likely worth revistiing here where a specific file is needed?
>>>
>>> -Grant
>>>
>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>>
>>>> This error is still bugging me. The exception:
>>>>
>>>> WARNING: java.io.FileNotFoundException: File
>>>> output/clusters-0/part-00000/* does not exist.
>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>> does not exist.
>>>>
>>>> ocurrs first at:
>>>>
>>>>
>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>>>>
>>>> which corresponds to:
>>>>
>>>>  private static boolean isConverged(String filePath, JobConf conf,
>>>> FileSystem fs)
>>>>     throws IOException {
>>>>   Path outPart = new Path(filePath + "/*");
>>>>   SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>>> conf);  <-- THIS
>>>>   ...
>>>>  }
>>>>
>>>> where isConverged() is called in this fashion:
>>>>
>>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>>
>>>> by runIteration(), which is previously invoked by runJob() like:
>>>>
>>>>    String clustersOut = output + "/clusters-" + iteration;
>>>>     converged = runIteration(input, clustersIn, clustersOut,
>>>> measureClass,
>>>>         delta, numReduceTasks, iteration);
>>>>
>>>> Consequently, assuming its the first iteration and the output folder
>>>> has been named "output" by the user, the SequenceFile.Reader receives
>>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>>> believe the path should end in "part-00000" and the  + "/*" should be
>>>> removed... although someone, evidently, thought otherwise.
>>>>
>>>> Any feedback?
>>>>
>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<[email protected]> wrote:
>>>>>
>>>>> I was using Canopy to create input clusters, but the error appeared
>>>>> while running kMeans (if I run kMeans' job only with previously
>>>>> created clusters from Canopy placed in output/canopies as initial
>>>>> clusters, it still fails). I noticed no other problems. I was using
>>>>> revision 790979 before updating.  Strangely, there were no changes in
>>>>> the job and drivers class from that revision. svn diff shows that the
>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>>
>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
>>>>> Eastman<[email protected]> wrote:
>>>>>>
>>>>>> Hum, no, it's looking for the output of the first iteration. Were
>>>>>> there
>>>>>> other errors? What was the last revision you were running? It does
>>>>>> look like
>>>>>> something got horked, as it should be looking for output/clusters-0/*.
>>>>>> Can
>>>>>> you diff the job and driver class to see what changed?
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> nfantone wrote:
>>>>>>>
>>>>>>> Fellows, today I updated to revision 791558 and while running kMeans
>>>>>>> I
>>>>>>> got the following exception:
>>>>>>>
>>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>>> does not exist.
>>>>>>>
>>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>>>> It seems as it's looking for any file inside a "part-00000"
>>>>>>> directory,
>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>>>> names for output files.
>>>>>>>
>>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<[email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>>>>>> sequence
>>>>>>>>> file format, but the Key is never used. To my knowledge, there is
>>>>>>>>> no
>>>>>>>>> requirement to assign identifiers to the input points*. Users are
>>>>>>>>> free
>>>>>>>>> to
>>>>>>>>> associate an arbitrary name field with each vector - also label
>>>>>>>>> mappings
>>>>>>>>> may
>>>>>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>>>>>> other
>>>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>>>> identifier
>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>>>>>> only.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The key may not be used internally, but externally they can prove to
>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector
>>>>>>>> represents
>>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>>>>>> for instance, retrieve user information using data directly from a
>>>>>>>> HDFS file's field.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Clustering from DB

Reply via email to