Re: Clustering from DB

zaki rahaman Wed, 15 Jul 2009 14:25:53 -0700

I hope I'm understanding your setup correctly but by running on one machine,
you're not fully exploiting the capabilities of Hadoop's Map/Reduce. Gains
in computation time will only be seen by increasing the number of cores or
nodes. If you need access to more computing power, you might want to
consider using Amazon's EC2 (they have preconfigured AMIs for Hadoop but
youd have to configure and install Mahout, a process which I'm not totally
familiar with as of yet as I'm still trying to do it myself).


On Wed, Jul 15, 2009 at 4:24 PM, nfantone <[email protected]> wrote:

> Well, I grew tired of watching the whole thing run and stopped it. I,
> then, started another test, this time around using a smaller dataset
> of 3Gb and it is still taking way too long.
> See inline comments.
>
> > You are only specifying a single reducer. Try increasing that as below.
>
> I did. I set it to my K value (200).
>
> > No, number of nodes is the number of nodes (computers) in your cluster.
> You
> > did not say how many nodes you are running on.
>
> I'm running and compiling the application on one simple desktop
> computer at work, and that isn't likely to change after the
> development process is finished.
>
> > Yes, Hadoop allocates this automatically. How many map tasks are being
> > spawned?
>
> Being uncertain of where (and when) Hadoop computes the adequate
> number of map tasks, what I did was inspect the following 'numMaps'
> variable while debugging:
>
>   if (name.startsWith("part") && !name.endsWith(".crc")) {
>         SequenceFile.Reader reader = new SequenceFile.Reader(fs,
> part.getPath(), conf);
>        int numMaps = conf.getNumMapTasks();
>  ... }
>
> which is at the beginning of the isConverged() method. Its current
> value is, in every iteration, 2. I suspect this isn't right at all,
> whether because this is not the proper place to ask for the number of
> maps or it's not being setted the way it should.
> From the Hadoop Javadoc:
>
> "The number of maps is usually driven by the total size of the inputs
> i.e. total number of blocks of the input files."
>
> In my input file each block represents a vector that corresponds to
> some computed user behavior. The number of users to be clustered (i.e
> the number of blocks) is a parameter expected by the application.
> Perhaps, I should -somehow- change the block size of my HDFS file? Or
> tweak something in the Configuration/FileSystem instance i'm using to
> write it?
>
> >> For now, the clustering is STILL running in the background, ha.
> >>
> >> On Wed, Jul 15, 2009 at 12:30 PM, Jeff
> >> Eastman<[email protected]> wrote:
> >>
> >>>
> >>> Glad to hear KMeans is working reliably now. Your performance problems
> >>> will
> >>> require some additional tuning. Here are some suggestions:
> >>> - You did not mention how many mappers are running in your job. With
> 60gb
> >>> in
> >>> a single input file, I would think Hadoop would allocate multiple
> mapper
> >>> tasks automatically, since there are thousands of potential splits. If
> >>> this
> >>> is not happening (is the file compressed?), then breaking it into
> >>> multiple
> >>> parts in a preprocessing step would allow you to get more concurrency
> in
> >>> the
> >>> map phase.
> >>> - Same with the reducers; how many are you running and what is your K?
> >>> The
> >>> default number of reducers is 2, but you can increase this up to the
> >>> number
> >>> of clusters to increase parallelism. Unlike Canopy and Mean Shift,
> KMeans
> >>> can use multiple reducers up to that limit.
> >>> - Finally, what is the size of your cluster? Adding machines would be
> >>> another way to increase concurrency, since map and reduce tasks are
> >>> spread
> >>> across the entire cluster.
> >>>
> >>> 60 gb is a small dataset for Hadoop. I don't think it should be taking
> >>> that
> >>> long.
> >>> Jeff
> >>>
> >>> nfantone wrote:
> >>>
> >>>>
> >>>> After updating to the latest revision, everything seems to be working
> >>>> just fine. However, the task I set up to do, user clustering by
> >>>> KMeans, is taking forever to complete: I initiated the job yesterday's
> >>>> morning and it's still running today (an elapsed time of nearly 18hs
> >>>> and counting...). Of course, the main reason behind it it's the huge
> >>>> size of the data set I'm trying to process (a ~60Gb HDFS file), but
> >>>> I'm looking for ways to improve the performance. Would splitting the
> >>>> input file into smaller parts do any difference? Is it even possible
> >>>> to set the Driver in order to use more than one input (right now, I'm
> >>>> specifying a full path to a single file, including its filename)? What
> >>>> about setting a higher number of reducers? Is there any drawbacks to
> >>>> that? Running multiple KMeans' job in several threads?
> >>>>
> >>>> Or perhaps, I'm just doing something wrong and should not be taking
> >>>> this long. Surely, I'm not the first one to encounter this running
> >>>> time issue with large datasets. Ideas, anyone?
> >>>>
> >>>>
> >>>> On Mon, Jul 13, 2009 at 2:39 PM, nfantone<[email protected]> wrote:
> >>>>
> >>>>
> >>>>>
> >>>>> Great work. It works like a charm now. Thank you very much.
> >>>>>
> >>>>> On Mon, Jul 13, 2009 at 1:41 PM, Jeff
> >>>>> Eastman<[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> r793620 fixes the KMeansDriver.isConverged() method to iterate over
> >>>>>> all
> >>>>>> cluster part files. Unit test now runs without error and the
> synthetic
> >>>>>> control job completes too.
> >>>>>>
> >>>>>>
> >>>>>> Jeff Eastman wrote:
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> In this case, the code should be reading all of the clusters into
> >>>>>>> memory
> >>>>>>> to see if they have all converged. These may be split into multiple
> >>>>>>> part
> >>>>>>> files if more than one reducer is specified. So /* is the correct
> >>>>>>> file
> >>>>>>> pattern and it is the calling site that should remove the
> /part-0000
> >>>>>>> reference. The code in isConverged should loop through all the
> parts,
> >>>>>>> returning if they have all converged or not.
> >>>>>>>
> >>>>>>> I'll take a detailed look tomorrow.
> >>>>>>>
> >>>>>>>
> >>>>>>> Grant Ingersoll wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Hmm, that might be a mistake on my part when trying to resolve how
> >>>>>>>> Hadoop
> >>>>>>>> 0.20 now resolves globs.  I somewhat blindly applied "/*" where
> >>>>>>>> needed, but
> >>>>>>>> I think it is likely worth revistiing here where a specific file
> is
> >>>>>>>> needed?
> >>>>>>>>
> >>>>>>>> -Grant
> >>>>>>>>
> >>>>>>>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This error is still bugging me. The exception:
> >>>>>>>>>
> >>>>>>>>> WARNING: java.io.FileNotFoundException: File
> >>>>>>>>> output/clusters-0/part-00000/* does not exist.
> >>>>>>>>> java.io.FileNotFoundException: File
> output/clusters-0/part-00000/*
> >>>>>>>>> does not exist.
> >>>>>>>>>
> >>>>>>>>> ocurrs first at:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)
> >>>>>>>>>
> >>>>>>>>> which corresponds to:
> >>>>>>>>>
> >>>>>>>>>  private static boolean isConverged(String filePath, JobConf
> conf,
> >>>>>>>>> FileSystem fs)
> >>>>>>>>>   throws IOException {
> >>>>>>>>>  Path outPart = new Path(filePath + "/*");
> >>>>>>>>>  SequenceFile.Reader reader = new SequenceFile.Reader(fs,
> outPart,
> >>>>>>>>> conf);  <-- THIS
> >>>>>>>>>  ...
> >>>>>>>>>  }
> >>>>>>>>>
> >>>>>>>>> where isConverged() is called in this fashion:
> >>>>>>>>>
> >>>>>>>>> return isConverged(clustersOut + "/part-00000", conf, fs);
> >>>>>>>>>
> >>>>>>>>> by runIteration(), which is previously invoked by runJob() like:
> >>>>>>>>>
> >>>>>>>>>  String clustersOut = output + "/clusters-" + iteration;
> >>>>>>>>>   converged = runIteration(input, clustersIn, clustersOut,
> >>>>>>>>> measureClass,
> >>>>>>>>>       delta, numReduceTasks, iteration);
> >>>>>>>>>
> >>>>>>>>> Consequently, assuming its the first iteration and the output
> >>>>>>>>> folder
> >>>>>>>>> has been named "output" by the user, the SequenceFile.Reader
> >>>>>>>>> receives
> >>>>>>>>> "output/clusters-0/part-00000/*" as a path, which is
> non-existent.
> >>>>>>>>> I
> >>>>>>>>> believe the path should end in "part-00000" and the  + "/*"
> should
> >>>>>>>>> be
> >>>>>>>>> removed... although someone, evidently, thought otherwise.
> >>>>>>>>>
> >>>>>>>>> Any feedback?
> >>>>>>>>>
> >>>>>>>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<[email protected]>
> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I was using Canopy to create input clusters, but the error
> >>>>>>>>>> appeared
> >>>>>>>>>> while running kMeans (if I run kMeans' job only with previously
> >>>>>>>>>> created clusters from Canopy placed in output/canopies as
> initial
> >>>>>>>>>> clusters, it still fails). I noticed no other problems. I was
> >>>>>>>>>> using
> >>>>>>>>>> revision 790979 before updating.  Strangely, there were no
> changes
> >>>>>>>>>> in
> >>>>>>>>>> the job and drivers class from that revision. svn diff shows
> that
> >>>>>>>>>> the
> >>>>>>>>>> only classes that changed in org.apache.mahout.clustering.kmeans
> >>>>>>>>>> package were KMeansInfo.java and RandomSeedGenerator.java
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff
> >>>>>>>>>> Eastman<[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hum, no, it's looking for the output of the first iteration.
> Were
> >>>>>>>>>>> there
> >>>>>>>>>>> other errors? What was the last revision you were running? It
> >>>>>>>>>>> does
> >>>>>>>>>>> look like
> >>>>>>>>>>> something got horked, as it should be looking for
> >>>>>>>>>>> output/clusters-0/*.
> >>>>>>>>>>> Can
> >>>>>>>>>>> you diff the job and driver class to see what changed?
> >>>>>>>>>>>
> >>>>>>>>>>> Jeff
> >>>>>>>>>>>
> >>>>>>>>>>> nfantone wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Fellows, today I updated to revision 791558 and while running
> >>>>>>>>>>>> kMeans
> >>>>>>>>>>>> I
> >>>>>>>>>>>> got the following exception:
> >>>>>>>>>>>>
> >>>>>>>>>>>> WARNING: java.io.FileNotFoundException: File
> >>>>>>>>>>>> output/clusters-0/part-00000/* does not exist.
> >>>>>>>>>>>> java.io.FileNotFoundException: File
> >>>>>>>>>>>> output/clusters-0/part-00000/*
> >>>>>>>>>>>> does not exist.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The algorithm isn't interrupted, though. But this exception
> >>>>>>>>>>>> wasn't
> >>>>>>>>>>>> thrown before the update and, to me, its message is not quite
> >>>>>>>>>>>> clear.
> >>>>>>>>>>>> It seems as it's looking for any file inside a "part-00000"
> >>>>>>>>>>>> directory,
> >>>>>>>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are
> >>>>>>>>>>>> default
> >>>>>>>>>>>> names for output files.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I could show the entire stack trace, if needed. Any pointers?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<[email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the feedback, Jeff.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The logical format of input to KMeans is <Key, Vector> as it
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>> sequence
> >>>>>>>>>>>>>> file format, but the Key is never used. To my knowledge,
> there
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>> no
> >>>>>>>>>>>>>> requirement to assign identifiers to the input points*.
> Users
> >>>>>>>>>>>>>> are
> >>>>>>>>>>>>>> free
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>> associate an arbitrary name field with each vector - also
> >>>>>>>>>>>>>> label
> >>>>>>>>>>>>>> mappings
> >>>>>>>>>>>>>> may
> >>>>>>>>>>>>>> be assigned - but these are not manipulated by KMeans or any
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> other
> >>>>>>>>>>>>>> clustering applications. The name field is now used as a
> >>>>>>>>>>>>>> vector
> >>>>>>>>>>>>>> identifier
> >>>>>>>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the
> output
> >>>>>>>>>>>>>> step
> >>>>>>>>>>>>>> only.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The key may not be used internally, but externally they can
> >>>>>>>>>>>>> prove
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>> be pretty useful. For me, keys are userIDs and each Vector
> >>>>>>>>>>>>> represents
> >>>>>>>>>>>>> his/her historical behavior. Being able to collect the output
> >>>>>>>>>>>>> information as <UserID, ClusterID> is quite neat as it allows
> >>>>>>>>>>>>> me
> >>>>>>>>>>>>> to,
> >>>>>>>>>>>>> for instance, retrieve user information using data directly
> >>>>>>>>>>>>> from
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>> HDFS file's field.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>> --------------------------
> >>>>>>>> Grant Ingersoll
> >>>>>>>> http://www.lucidimagination.com/
> >>>>>>>>
> >>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> >>>>>>>> using
> >>>>>>>> Solr/Lucene:
> >>>>>>>> http://www.lucidimagination.com/search
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
>



-- 
Zaki Rahaman

Re: Clustering from DB

Reply via email to