Re: Clustering from DB

nfantone Sun, 12 Jul 2009 17:56:27 -0700

Yes. After taking another look into it, I'll tend yo agree with Jeff
here. isConverged() should be receiving an absolute path to a
directory containing all the clusters, which could have been split
into several parts.


I'll also look into that tomorrow, at work.

On Sun, Jul 12, 2009 at 7:51 PM, Jeff Eastman<[email protected]> wrote:
> In this case, the code should be reading all of the clusters into memory to
> see if they have all converged. These may be split into multiple part files
> if more than one reducer is specified. So /* is the correct file pattern and
> it is the calling site that should remove the /part-0000 reference. The code
> in isConverged should loop through all the parts, returning if they have all
> converged or not.
>
> I'll take a detailed look tomorrow.
>
>
> Grant Ingersoll wrote:
>>
>> Hmm, that might be a mistake on my part when trying to resolve how Hadoop
>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where needed, but
>> I think it is likely worth revistiing here where a specific file is needed?
>>
>> -Grant
>>
>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>
>>> This error is still bugging me. The exception:
>>>
>>> WARNING: java.io.FileNotFoundException: File
>>> output/clusters-0/part-00000/* does not exist.
>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>> does not exist.
>>>
>>> ocurrs first at:
>>>
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
>>>
>>> which corresponds to:
>>>
>>>  private static boolean isConverged(String filePath, JobConf conf,
>>> FileSystem fs)
>>>     throws IOException {
>>>   Path outPart = new Path(filePath + "/*");
>>>   SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>> conf);  <-- THIS
>>>   ...
>>>  }
>>>
>>> where isConverged() is called in this fashion:
>>>
>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>
>>> by runIteration(), which is previously invoked by runJob() like:
>>>
>>>    String clustersOut = output + "/clusters-" + iteration;
>>>     converged = runIteration(input, clustersIn, clustersOut,
>>> measureClass,
>>>         delta, numReduceTasks, iteration);
>>>
>>> Consequently, assuming its the first iteration and the output folder
>>> has been named "output" by the user, the SequenceFile.Reader receives
>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>> believe the path should end in "part-00000" and the  + "/*" should be
>>> removed... although someone, evidently, thought otherwise.
>>>
>>> Any feedback?
>>>
>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<[email protected]> wrote:
>>>>
>>>> I was using Canopy to create input clusters, but the error appeared
>>>> while running kMeans (if I run kMeans' job only with previously
>>>> created clusters from Canopy placed in output/canopies as initial
>>>> clusters, it still fails). I noticed no other problems. I was using
>>>> revision 790979 before updating.  Strangely, there were no changes in
>>>> the job and drivers class from that revision. svn diff shows that the
>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>
>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<[email protected]>
>>>> wrote:
>>>>>
>>>>> Hum, no, it's looking for the output of the first iteration. Were there
>>>>> other errors? What was the last revision you were running? It does look
>>>>> like
>>>>> something got horked, as it should be looking for output/clusters-0/*.
>>>>> Can
>>>>> you diff the job and driver class to see what changed?
>>>>>
>>>>> Jeff
>>>>>
>>>>> nfantone wrote:
>>>>>>
>>>>>> Fellows, today I updated to revision 791558 and while running kMeans I
>>>>>> got the following exception:
>>>>>>
>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>> does not exist.
>>>>>>
>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>>> It seems as it's looking for any file inside a "part-00000" directory,
>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>>> names for output files.
>>>>>>
>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it is in
>>>>>>>> sequence
>>>>>>>> file format, but the Key is never used. To my knowledge, there is no
>>>>>>>> requirement to assign identifiers to the input points*. Users are
>>>>>>>> free
>>>>>>>> to
>>>>>>>> associate an arbitrary name field with each vector - also label
>>>>>>>> mappings
>>>>>>>> may
>>>>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>>>>> other
>>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>>> identifier
>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>>>>> only.
>>>>>>>>
>>>>>>>
>>>>>>> The key may not be used internally, but externally they can prove to
>>>>>>> be pretty useful. For me, keys are userIDs and each Vector represents
>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>>>>> for instance, retrieve user information using data directly from a
>>>>>>> HDFS file's field.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
>

Re: Clustering from DB

Reply via email to