Thanks for the tip, I had been generating the representative points
sequentially but was still using the MR versions of the clustering
algorithms, I'll change that now.
:)
I just tried this, and there seems to be a difference in behaviour
between the sequential and MR versions of Canopy. With MR:
* Mapper called for each point, which calls
canopyClusterer.addPointToCanopies(point.get(), canopies); - in my
case 128 canopies are created
* Reducer called with the canopy centroid points, which then calls
canopyClusterer.addPointToCanopies(point, canopies); for each of
these centroids - and I end up with 11 canopies.
And we end up with canopies of canopy centroids.
However, the sequential version doesn't appear to have the equivalent of
the Reducer steps, which means that it contains the original number of
canopies. Should it also compute the "canopies of canopies"? At the
moment, the MR version is working much better for me with the second
canopy generation step, so I'll stick with this for now. I guess it
should be consistent between sequential and MR? I should probably start
a separate thread for this...
I guess I don't quite understand your question. Can you please elaborate?
Sorry, what I wanted to ask was: is it okay to use
ClusterEvaluator.intraClusterDensity()? Or should only
ClusterEvaluator.interClusterDensity() be used?
I have to leave for the evening, but if you need me to check anything
further here re: canopy I can take a look tomorrow.