Thanks for the tip, I had been generating the representative points sequentially but was still using the MR versions of the clustering algorithms, I'll change that now.
:)

I just tried this, and there seems to be a difference in behaviour between the sequential and MR versions of Canopy. With MR:

   * Mapper called for each point, which calls
     canopyClusterer.addPointToCanopies(point.get(), canopies); - in my
     case 128 canopies are created
   * Reducer called with the canopy centroid points, which then calls
     canopyClusterer.addPointToCanopies(point, canopies); for each of
     these centroids - and I end up with 11 canopies.

And we end up with canopies of canopy centroids.

However, the sequential version doesn't appear to have the equivalent of the Reducer steps, which means that it contains the original number of canopies. Should it also compute the "canopies of canopies"? At the moment, the MR version is working much better for me with the second canopy generation step, so I'll stick with this for now. I guess it should be consistent between sequential and MR? I should probably start a separate thread for this...




I guess I don't quite understand your question. Can you please elaborate?


Sorry, what I wanted to ask was: is it okay to use ClusterEvaluator.intraClusterDensity()? Or should only ClusterEvaluator.interClusterDensity() be used?

I have to leave for the evening, but if you need me to check anything further here re: canopy I can take a look tomorrow.

Reply via email to