Re: kMeans Help

Grant Ingersoll Sun, 28 Jun 2009 14:55:38 -0700


On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:

I get all of this, my point is that when you rehydrate the Cluster,it doesn't properly report the centroid per my email all becausenumPoints == 0 and pointTotal is a a vector that is the same as thepassed in center vector, but initialized to 0.

In other words, the simple act of serializing a Cluster to HDFS andthen reconstituting it should not alter the result one gets, which Ibelieve is what happens if one dumps out the clusters that have beencalculated after the whole process is done.

On Jun 27, 2009, at 11:42 AM, Jeff Eastman wrote:
I think this comment is on the right track. During an iteration,each cluster is created with a center and no points. Then, as eachpoint is compared against the cluster centers, it is added to theclosest cluster. If the initial center is considered to be a point,then it will bias the new centroid calculation towards its center,incorrectly, as shown below.
One could argue that the centroid of a degenerate cluster with nopoints ought to be its center and not a zero vector, but clusterswith points should have centroids that do not include it.
nfantone wrote:
On Sat, Jun 27, 2009 at 8:10 AM, GrantIngersoll<[email protected]> wrote:
On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:
The semantics of constructing a Cluster are odd to me. Do Ialways haveto immediately add a point to the Cluster in order for it to be"real",despite the fact that I added a Center? Isn't adding a Centereffectively
giving the Cluster one point?
Perhaps I misunderstood you, but I think that by assigning a newpoint
(by calling addPoint(Vector)) to a Cluster does not mean you are
"adding a center". A center is specified at the beginning of the
algorithm and every iteration, after including a set of new points,
recalculates that center by determining a new means - which is nowthecentroid of that particular Cluster. So, clearly, the centeritself isa proper point in the Cluster and you don't need to add it afterbeing
selected as that in order for it to be "real".
And if you add the center, why isn't it the centroid until otherpoints are
added?
Again, the centroid is the result of a recalculation of a means and
may or may not be a real point. By having just one point in aCluster- that is to say, its center - there's no "recalculation" to bedone.Conceptually, you could say the centroid lies, in fact, in thecenter
- though, it's not relevant to the algorithm.
A final example. Let's say you create a Cluster C with point (1,1)as
its center. Then, you add (3,3) to it.

Cluster C: (1,1);(3,3) - original center: (1,1) - centroid: (2,2)
Now, you create another Cluster C' with the same center, butdecide to
add the point again. Then, (3,3) is added.
Cluster C': (1,1);(1,1);(3,3) - original center: (1,1) - centroid(5/3, 5/3).
Ok, that was an unnecesary example. Got it. But it shows that Cand C'
are not the same cluster, based on the fact that point repetition
contribute to a general means.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:
http://www.lucidimagination.com/search


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: kMeans Help

Reply via email to