Re: Clustering from DB

Jeff Eastman Wed, 15 Jul 2009 08:30:58 -0700

Glad to hear KMeans is working reliably now. Your performance problemswill require some additional tuning. Here are some suggestions:- You did not mention how many mappers are running in your job. With60gb in a single input file, I would think Hadoop would allocatemultiple mapper tasks automatically, since there are thousands ofpotential splits. If this is not happening (is the file compressed?),then breaking it into multiple parts in a preprocessing step would allowyou to get more concurrency in the map phase.- Same with the reducers; how many are you running and what is your K?The default number of reducers is 2, but you can increase this up to thenumber of clusters to increase parallelism. Unlike Canopy and MeanShift, KMeans can use multiple reducers up to that limit.- Finally, what is the size of your cluster? Adding machines would beanother way to increase concurrency, since map and reduce tasks arespread across the entire cluster.

60 gb is a small dataset for Hadoop. I don't think it should be takingthat long.

Jeff


nfantone wrote:

After updating to the latest revision, everything seems to be working
just fine. However, the task I set up to do, user clustering by
KMeans, is taking forever to complete: I initiated the job yesterday's
morning and it's still running today (an elapsed time of nearly 18hs
and counting...). Of course, the main reason behind it it's the huge
size of the data set I'm trying to process (a ~60Gb HDFS file), but
I'm looking for ways to improve the performance. Would splitting the
input file into smaller parts do any difference? Is it even possible
to set the Driver in order to use more than one input (right now, I'm
specifying a full path to a single file, including its filename)? What
about setting a higher number of reducers? Is there any drawbacks to
that? Running multiple KMeans' job in several threads?

Or perhaps, I'm just doing something wrong and should not be taking
this long. Surely, I'm not the first one to encounter this running
time issue with large datasets. Ideas, anyone?


On Mon, Jul 13, 2009 at 2:39 PM, nfantone<[email protected]> wrote:

Great work. It works like a charm now. Thank you very much.

On Mon, Jul 13, 2009 at 1:41 PM, Jeff Eastman<[email protected]> wrote:

r793620 fixes the KMeansDriver.isConverged() method to iterate over all
cluster part files. Unit test now runs without error and the synthetic
control job completes too.


Jeff Eastman wrote:

In this case, the code should be reading all of the clusters into memory
to see if they have all converged. These may be split into multiple part
files if more than one reducer is specified. So /* is the correct file
pattern and it is the calling site that should remove the /part-0000
reference. The code in isConverged should loop through all the parts,
returning if they have all converged or not.

I'll take a detailed look tomorrow.


Grant Ingersoll wrote:

Hmm, that might be a mistake on my part when trying to resolve how Hadoop
0.20 now resolves globs.  I somewhat blindly applied "/*" where needed, but
I think it is likely worth revistiing here where a specific file is needed?

-Grant

On Jul 10, 2009, at 3:08 PM, nfantone wrote:

This error is still bugging me. The exception:

WARNING: java.io.FileNotFoundException: File
output/clusters-0/part-00000/* does not exist.
java.io.FileNotFoundException: File output/clusters-0/part-00000/*
does not exist.

ocurrs first at:


org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)

which corresponds to:

 private static boolean isConverged(String filePath, JobConf conf,
FileSystem fs)
    throws IOException {
  Path outPart = new Path(filePath + "/*");
  SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
conf);  <-- THIS
  ...
 }

where isConverged() is called in this fashion:

return isConverged(clustersOut + "/part-00000", conf, fs);

by runIteration(), which is previously invoked by runJob() like:

   String clustersOut = output + "/clusters-" + iteration;
    converged = runIteration(input, clustersIn, clustersOut,
measureClass,
        delta, numReduceTasks, iteration);

Consequently, assuming its the first iteration and the output folder
has been named "output" by the user, the SequenceFile.Reader receives
"output/clusters-0/part-00000/*" as a path, which is non-existent. I
believe the path should end in "part-00000" and the  + "/*" should be
removed... although someone, evidently, thought otherwise.

Any feedback?

On Mon, Jul 6, 2009 at 5:39 PM, nfantone<[email protected]> wrote:

I was using Canopy to create input clusters, but the error appeared
while running kMeans (if I run kMeans' job only with previously
created clusters from Canopy placed in output/canopies as initial
clusters, it still fails). I noticed no other problems. I was using
revision 790979 before updating.  Strangely, there were no changes in
the job and drivers class from that revision. svn diff shows that the
only classes that changed in org.apache.mahout.clustering.kmeans
package were KMeansInfo.java and RandomSeedGenerator.java

On Mon, Jul 6, 2009 at 3:55 PM, Jeff
Eastman<[email protected]> wrote:

Hum, no, it's looking for the output of the first iteration. Were
there
other errors? What was the last revision you were running? It does
look like
something got horked, as it should be looking for output/clusters-0/*.
Can
you diff the job and driver class to see what changed?

Jeff

nfantone wrote:

Fellows, today I updated to revision 791558 and while running kMeans
I
got the following exception:

WARNING: java.io.FileNotFoundException: File
output/clusters-0/part-00000/* does not exist.
java.io.FileNotFoundException: File output/clusters-0/part-00000/*
does not exist.

The algorithm isn't interrupted, though. But this exception wasn't
thrown before the update and, to me, its message is not quite clear.
It seems as it's looking for any file inside a "part-00000"
directory,
which doesn't exist; and, as far as I know, "part-xxxxx" are default
names for output files.

I could show the entire stack trace, if needed. Any pointers?


On Thu, Jul 2, 2009 at 3:16 PM, nfantone<[email protected]> wrote:

Thanks for the feedback, Jeff.

The logical format of input to KMeans is <Key, Vector> as it is in
sequence
file format, but the Key is never used. To my knowledge, there is
no
requirement to assign identifiers to the input points*. Users are
free
to
associate an arbitrary name field with each vector - also label
mappings
may
be assigned - but these are not manipulated by KMeans or any of the
other
clustering applications. The name field is now used as a vector
identifier
by the KMeansClusterMapper - if it is non-null - in the output step
only.

The key may not be used internally, but externally they can prove to
be pretty useful. For me, keys are userIDs and each Vector
represents
his/her historical behavior. Being able to collect the output
information as <UserID, ClusterID> is quite neat as it allows me to,
for instance, retrieve user information using data directly from a
HDFS file's field.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search

Re: Clustering from DB

Reply via email to