I get all of this, my point is that when you rehydrate the Cluster, it doesn't properly report the centroid per my email all because numPoints == 0 and pointTotal is a a vector that is the same as the passed in center vector, but initialized to 0.


On Jun 27, 2009, at 11:42 AM, Jeff Eastman wrote:

I think this comment is on the right track. During an iteration, each cluster is created with a center and no points. Then, as each point is compared against the cluster centers, it is added to the closest cluster. If the initial center is considered to be a point, then it will bias the new centroid calculation towards its center, incorrectly, as shown below.

One could argue that the centroid of a degenerate cluster with no points ought to be its center and not a zero vector, but clusters with points should have centroids that do not include it.

nfantone wrote:
On Sat, Jun 27, 2009 at 8:10 AM, Grant Ingersoll<[email protected]> wrote:

On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:


The semantics of constructing a Cluster are odd to me. Do I always have to immediately add a point to the Cluster in order for it to be "real", despite the fact that I added a Center? Isn't adding a Center effectively
giving the Cluster one point?



Perhaps I misunderstood you, but I think that by assigning a new point
(by calling addPoint(Vector)) to a Cluster does not mean you are
"adding a center". A center is specified at the beginning of the
algorithm and every iteration, after including a set of new points,
recalculates that center by determining a new means - which is now the centroid of that particular Cluster. So, clearly, the center itself is a proper point in the Cluster and you don't need to add it after being
selected as that in order for it to be "real".


And if you add the center, why isn't it the centroid until other points are
added?



Again, the centroid is the result of a recalculation of a means and
may or may not be a real point. By having just one point in a Cluster
- that is to say, its center - there's no "recalculation" to be done.
Conceptually, you could say the centroid lies, in fact, in the center
- though, it's not relevant to the algorithm.

A final example. Let's say you create a Cluster C with point (1,1) as
its center. Then, you add (3,3) to it.

Cluster C: (1,1);(3,3) - original center: (1,1) - centroid: (2,2)

Now, you create another Cluster C' with the same center, but decide to
add the point again. Then, (3,3) is added.

Cluster C': (1,1);(1,1);(3,3) - original center: (1,1) - centroid (5/3, 5/3).

Ok, that was an unnecesary example. Got it. But it shows that C and C'
are not the same cluster, based on the fact that point repetition
contribute to a general means.





--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to