On Jun 28, 2009, at 5:55 PM, Grant Ingersoll wrote:
On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:
I get all of this, my point is that when you rehydrate the Cluster,
it doesn't properly report the centroid per my email all because
numPoints == 0 and pointTotal is a a vector that is the same as the
passed in center vector, but initialized to 0.
In other words, the simple act of serializing a Cluster to HDFS and
then reconstituting it should not alter the result one gets, which I
believe is what happens if one dumps out the clusters that have been
calculated after the whole process is done.
[1] is what I had to do to work around it for the Random approach, but
I think it isn't the right approach.
I think the problem lies in computeCentroid:
private Vector computeCentroid() {
if (numPoints == 0)
return pointTotal;
else if (centroid == null) {
// lazy compute new centroid
centroid = pointTotal.divide(numPoints);
Vector stds = pointSquaredTotal.times(numPoints).minus(
pointTotal.times(pointTotal)).assign(new
SquareRootFunction())
.divide(numPoints);
std = stds.zSum() / 2;
}
return centroid;
}
I don't understand why, if numPoints ==0, the next line isn't just:
return center; Why wouldn't the center and the centroid be the same
if there are no points? pointTotal in the rehydration case (or in the
case of just calling new Cluster(center) is just a vector of the same
cardinality as Center but all values are zero.
[1]:
Author: gsingers
Date: Sat Jun 27 02:57:18 2009
New Revision: 788919
URL: http://svn.apache.org/viewvc?rev=788919&view=rev
Log:
add the center as a point
Modified:
lucene/mahout/trunk/core/src/main/java/org/apache/mahout/
clustering/kmeans/RandomSeedGenerator.java
Modified: lucene/mahout/trunk/core/src/main/java/org/apache/mahout/
clustering/kmeans/RandomSeedGenerator.java
URL:
http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?rev=788919&r1=788918&r2=788919&view=diff
=
=
=
=
=
=
========================================================================
--- lucene/mahout/trunk/core/src/main/java/org/apache/mahout/
clustering/kmeans/RandomSeedGenerator.java (original)
+++ lucene/mahout/trunk/core/src/main/java/org/apache/mahout/
clustering/kmeans/RandomSeedGenerator.java Sat Jun 27 02:57:18 2009
@@ -54,7 +54,9 @@
if (log.isInfoEnabled()) {
log.info("Selected: " + value.asFormatString());
}
- writer.append(new Text(key.toString()), new Cluster(value));
+ Cluster val = new Cluster(value);
+ val.addPoint(value);
+ writer.append(new Text(key.toString()), val);
count++;
}
}