OK, please commit. Thx!
On Jun 29, 2009, at 4:47 PM, Jeff Eastman wrote:
Changing the centroid of an empty cluster to return its center fixes
a bug in the convergence calculation and causes convergence to
happen earlier. By returning a zero centroid vector instead of the
center, the convergence test had marked empty clusters as not
converged. This changes the outcome of the clustering. I changed the
expectedNumPoints[2] to be {4,4,1} and the test passes.
Grant Ingersoll wrote:
FYI, if I make this change the only test that fails is
TestKmeansClustering#testReferenceImplementation.
See MAHOUT-141
On Jun 29, 2009, at 12:07 PM, Jeff Eastman wrote:
I have no problem with returning center as the centroid for a
cluster with no points. From Ted's earlier discussion, the center
is the prior expectation of the centroid and returning a zero
vector is just a bug that has not made itself apparent until now.
I also agree that serializing and then deserializing a cluster (or
any object for that matter) should not alter its state.
Grant Ingersoll wrote:
On Jun 28, 2009, at 5:55 PM, Grant Ingersoll wrote:
On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:
I get all of this, my point is that when you rehydrate the
Cluster, it doesn't properly report the centroid per my email
all because numPoints == 0 and pointTotal is a a vector that is
the same as the passed in center vector, but initialized to 0.
In other words, the simple act of serializing a Cluster to HDFS
and then reconstituting it should not alter the result one gets,
which I believe is what happens if one dumps out the clusters
that have been calculated after the whole process is done.
[1] is what I had to do to work around it for the Random
approach, but I think it isn't the right approach.
I think the problem lies in computeCentroid:
private Vector computeCentroid() {
if (numPoints == 0)
return pointTotal;
else if (centroid == null) {
// lazy compute new centroid
centroid = pointTotal.divide(numPoints);
Vector stds = pointSquaredTotal.times(numPoints).minus(
pointTotal.times(pointTotal)).assign(new
SquareRootFunction())
.divide(numPoints);
std = stds.zSum() / 2;
}
return centroid;
}
I don't understand why, if numPoints ==0, the next line isn't
just: return center; Why wouldn't the center and the centroid be
the same if there are no points? pointTotal in the rehydration
case (or in the case of just calling new Cluster(center) is just
a vector of the same cardinality as Center but all values are zero.
[1]:
Author: gsingers
Date: Sat Jun 27 02:57:18 2009
New Revision: 788919
URL: http://svn.apache.org/viewvc?rev=788919&view=rev
Log:
add the center as a point
Modified:
lucene/mahout/trunk/core/src/main/java/org/apache/mahout/
clustering/kmeans/RandomSeedGenerator.java
Modified: lucene/mahout/trunk/core/src/main/java/org/apache/
mahout/clustering/kmeans/RandomSeedGenerator.java
URL:
http://svn.apache.org/viewvc/lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?rev=788919&r1=788918&r2=788919&view=diff
=
=
=
=
=
=
=
=
=
=
=
===================================================================
--- lucene/mahout/trunk/core/src/main/java/org/apache/mahout/
clustering/kmeans/RandomSeedGenerator.java (original)
+++ lucene/mahout/trunk/core/src/main/java/org/apache/mahout/
clustering/kmeans/RandomSeedGenerator.java Sat Jun 27 02:57:18 2009
@@ -54,7 +54,9 @@
if (log.isInfoEnabled()) {
log.info("Selected: " + value.asFormatString());
}
- writer.append(new Text(key.toString()), new
Cluster(value));
+ Cluster val = new Cluster(value);
+ val.addPoint(value);
+ writer.append(new Text(key.toString()), val);
count++;
}
}
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search