On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:
I get all of this, my point is that when you rehydrate the Cluster,
it doesn't properly report the centroid per my email all because
numPoints == 0 and pointTotal is a a vector that is the same as the
passed in center vector, but initialized to 0.
In other words, the simple act of serializing a Cluster to HDFS and
then reconstituting it should not alter the result one gets, which I
believe is what happens if one dumps out the clusters that have been
calculated after the whole process is done.
On Jun 27, 2009, at 11:42 AM, Jeff Eastman wrote:
I think this comment is on the right track. During an iteration,
each cluster is created with a center and no points. Then, as each
point is compared against the cluster centers, it is added to the
closest cluster. If the initial center is considered to be a point,
then it will bias the new centroid calculation towards its center,
incorrectly, as shown below.
One could argue that the centroid of a degenerate cluster with no
points ought to be its center and not a zero vector, but clusters
with points should have centroids that do not include it.
nfantone wrote:
On Sat, Jun 27, 2009 at 8:10 AM, Grant
Ingersoll<[email protected]> wrote:
On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:
The semantics of constructing a Cluster are odd to me. Do I
always have
to immediately add a point to the Cluster in order for it to be
"real",
despite the fact that I added a Center? Isn't adding a Center
effectively
giving the Cluster one point?
Perhaps I misunderstood you, but I think that by assigning a new
point
(by calling addPoint(Vector)) to a Cluster does not mean you are
"adding a center". A center is specified at the beginning of the
algorithm and every iteration, after including a set of new points,
recalculates that center by determining a new means - which is now
the
centroid of that particular Cluster. So, clearly, the center
itself is
a proper point in the Cluster and you don't need to add it after
being
selected as that in order for it to be "real".
And if you add the center, why isn't it the centroid until other
points are
added?
Again, the centroid is the result of a recalculation of a means and
may or may not be a real point. By having just one point in a
Cluster
- that is to say, its center - there's no "recalculation" to be
done.
Conceptually, you could say the centroid lies, in fact, in the
center
- though, it's not relevant to the algorithm.
A final example. Let's say you create a Cluster C with point (1,1)
as
its center. Then, you add (3,3) to it.
Cluster C: (1,1);(3,3) - original center: (1,1) - centroid: (2,2)
Now, you create another Cluster C' with the same center, but
decide to
add the point again. Then, (3,3) is added.
Cluster C': (1,1);(1,1);(3,3) - original center: (1,1) - centroid
(5/3, 5/3).
Ok, that was an unnecesary example. Got it. But it shows that C
and C'
are not the same cluster, based on the fact that point repetition
contribute to a general means.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search