Re: Clustering from DB

Grant Ingersoll Mon, 27 Jul 2009 08:04:08 -0700

Hmm, some profiling shows the pain is in the distance calculation foremitPointToNearestCluster. Seems that we only use the optimizeddistance calculations for testing convergence, but shouldn't we alsouse it for calculating the distances to the cluster, too?


On Jul 27, 2009, at 10:19 AM, Grant Ingersoll wrote:

I can confirm it is taking a while. I spun up the dataset providedand am on the first iteration, the mapper is at 50% and it has beenover an hour.
Not a good sign.  I will try profiling.

On Jul 27, 2009, at 10:07 AM, Jeff Eastman wrote:
It's been over a year since I ran any tests of KMeans on largerdata sets and there has been a lot of refactoring done in theinterim. I was also using only dense vectors. It is entirelypossible it is now doing something really poorly. I'm surprisedthat it is taking such a long time to munch such a small datasetbut it sounds like you can reproduce it on a single machine soprofiling should suggest the root cause. I'm going to be away fromthe computer for the next two weeks - a real vacation - sounfortunately I won't be able to contribute to this effort.
Jeff

Grant Ingersoll wrote:
On Jul 27, 2009, at 12:00 AM, nfantone wrote:
Thanks, Grant. I just updated and notice the change.
As a side note: you think someone could run some real tests onkMeans,in particular, other than the ones already in the project? I betthere
are other naive (or not so naive) problems like that. After much
coding, reading and experimenting in the last weeks withclustering inMahout, I am inclined to say something may not fully work withkMeans,
as of now. Or perhaps it just needs some refactoring/performance
tweaks. Jeff have claimed to run the job over gigabytes of data,using
a rather small cluster, in minutes. Have anyone tried to accomplish
this recently (since the hadoop upgrade to 0.20)? Just use
ClusteringUtils to write a file of some (arguably not so)significantnumber of random Vectors (say, 800.000+) and let that be theinput of
a KMeansMRJob (testKMeansMRJob() could very well serve this purpose
with little change). You'll end up with a file of about ~85MB to
~100MB, which can easily fit into memory in any modern computer.Now,
run the whole thing (I've tried both, locally and using a three
node-cluster setup - which, frankly, seemed like a bit too much
computing power for such small number of items in the dataset).It'll
take forever to complete.
I hope to hit this soon. I've got some Amazon credits I need touse and hope to put them towards this.
As with any project in open source, we need people to kick thetires, give feedback (thank you!) and also poke around the code tomake it better.
Have you tried your data with some other clustering code, perhapsWeka or something like that?
This simple methods could be used to generate any given number of
random SparseVectors for testing's sake, if anyone is interested:

private static Random rnd = new Random();
private static final int CARDINALITY = 1200;
private static final int MAX_NON_ZEROS = 200;
private static final int MAX_VECTORS = 850000;

private static Vector getRandomVector() {
  Integer id = rnd.nextInt(Integer.MAX_VALUE);
  Vector v = new SparseVector(id.toString(), CARDINALITY);
  int nonZeros = 0;
  while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0);
  for (int i = 0; i < nonZeros; i++) {
      v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble());
  }
  return v;
}

private static List<Vector> getVectors() {
    List<Vector> vectors = new ArrayList<Vector>(MAX_VECTORS);
    for (int i = 0; i < MAX_VECTORS; i++){
        vectors.add(getRandomVector());
    }
    return vectors;
}
I'm not sure why testing with Random vectors would be all thatuseful other than it shows it runs. I wouldn't expect anythinguseful to come out of it, though.
On Sun, Jul 26, 2009 at 10:30 PM, Grant Ingersoll<[email protected]> wrote:
Fixed on MAHOUT-152

On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:
That does indeed look like a problem.  I'll fix.

On Jul 26, 2009, at 2:37 PM, nfantone wrote:
While (still) experiencing performance issues and inspectingkMeanscode, I found this lying aroundSquaredEuclideanDistanceMeasure.java:
public double distance(double centroidLengthSquare, Vectorcentroid,
Vector v) {
if (centroid.size() != centroid.size()) {
 throw new CardinalityException();
}
...
}
I bet someone meant to compare centroid and v sizes and didn'tnoticed.
On Fri, Jul 24, 2009 at 12:38 PM, nfantone<[email protected]>wrote:
Well, as it turned out, it didn't have anything to do with my
performance issue but I found out that writing a Cluster(with asingle vector as its center) to a file and then reading it,requiresthe center to be added as point; otherwise, you won't be ableto
retrieve it as it should. Therefore, one should do:

// Writing
String id = "someID";
Vector v = new SparseVector();
Cluster c = new Cluster(v);
c.addPoint(v);
seqWriter.append(new Text(id), c);

// Reading
Writable key = (Writable)seqReader.getKeyClass().newInstance();Cluster value = (Cluster)seqReader.getValueClass().newInstance();
while (seqReader.next(key, value)) {
...
Vector centroid = value.getCenter();
...
}
This way, 'key' corresponds to 'id' and 'v' to 'centroid'. Ithinkthis shouldn't happen. Then again, it's not that relevant, Iguess.
Sorry for bringing different subjects to the same thread.
On Fri, Jul 24, 2009 at 9:14 AM, nfantone<[email protected]>wrote:
I've been using RandomSeedGenerator to generate initialclusters for
kMeans and while checking its code I stumbled upon this:

 while (reader.next(key, value)) {
   Cluster newCluster = new Cluster(value);
   newCluster.addPoint(value);
   ....
 }
I can see it adds the vector to the newly created cluster,even though
it is setting it as its center in the constructor. Wasn't this
corrected in a past revision? I thought this was not necessary
anymore. I'll look into it a little bit more and see if thishas
something to do with my lack of performance with my dataset.
On Thu, Jul 23, 2009 at 3:45 PM,nfantone<[email protected]> wrote:
Perhaps a larger convergence value might help (-d, Ibelieve).
I'll try that.
There was no significant change while modifying theconvergence value.At least, none was observed during the first threeiterations which
lasted the same amount of time than before, more or less.
Is there any chance your data is publicly shareable?Come to think
of
it,
with the vector representations, as long as you don'tpublish the
key
(which
terms map to which index), I would think most all datais publicly
shareable.
I'm sorry, I don't quite understand what you're asking.Publiclyshareable? As in user-permissions to access/read/writethe data?
As in post a copy of the SequenceFile somewhere fordownload,
assuming you
can.  Then others could presumably try it out.
My bad. Of course it is:

http://cringer.3kh.net/web/user-dataset.data.tar.bz2

That's the ~62Mb SequenceFile sample I've using, in <Text,
SparseVector> logical format.
That does seem like an awfully long time for 62 MB on a 6node
cluster. How many >terations are running?
I'm running the whole thing with a 20 iterations cap. Everyiteration- EXCEPT the first one which, oddly, lasted just twominutes-, took
around 3hs to complete:

Hadoop job_200907221734_0001
Finished in: 1mins, 42sec

Hadoop job_200907221734_0004
Finished in: 2hrs, 34mins, 3sec

Hadoop job_200907221734_0005
Finished in: 2hrs, 59mins, 34sec
How did you generate your initial clusters?
I generate the initial clusters via the RandomSeedGeneratorsetting a'k' value of 200. This is what I did to initiate theprocess for the
first time:

./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
input/user.data
./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
init/user.data
./bin/hadoop jar ~/mahout-core-0.2.jar
org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
init -o output -r 32 -d 0.01 -k 200
Where are the iteration jobs spending most of their time(map vs.
reduce)
I'm tempted to say map here, but their spent time is rather
comparable, actually. Reduce attempts are taking an hourand a half toend (average), and so are Map attempts. Here are somerepresentative
examples from the web UI:

reduce
attempt_200907221734_0002_r_000006_0
22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)

map
attempt_200907221734_0002_m_000000_0
22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)

Perhaps, there's some inconvenient in the way I create the
SequenceFile? I could share the JAVA code as well, ifrequired.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:
http://www.lucidimagination.com/search


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Clustering from DB

Reply via email to