Derek,
The Canopy implementation was probably one of the first Mahout commits.
Its reference implementation performs a single pass over the data and,
in your case, produces 128 canopies. It is the correct, published Canopy
algorithm. In order to become scalable, the MR version does this in each
mapper, and then again in the reducer to combine the results of the
mapper canopies. This approach was taken from a Google presentation,
iirc, and it seems to produce good results. At least it has withstood
the test of time.
When I added the sequential execution mode to canopy, I just used the
existing reference implementation. Now you have noticed that the results
are quite different when running the MR version beside the sequential
version.
I'm not sure which knob to turn here: A) try to modify the MR version to
perform a single pass; B) add another pass to the sequential version; or
C) just document the difference. A is a hard problem (maybe 0.5) and B
an easy change (ok for 0.4). Going for the "low hanging fruit", I'm
inclined to do B for consistency.
Can we get some opinions on this from the other Mahouts?
Jeff
PS: On the usability of ClusterEvaluator.intraClusterDensity() (vs.
CDbwEvaluator.intraClusterDensity() I presume), I don't have an opinion.
Both are pretty experimental IMHO and I'd rather not use "should" for
either. It would be interesting to develop some standard data sets
against which to compare them both under all of the clustering
algorithms. Perhaps a nice wiki page or technical paper for someone to
write. I think both evaluators can give useful insight. Again, pick your
poison.
On 9/30/10 12:36 PM, Derek O'Callaghan wrote:
Thanks for the tip, I had been generating the representative points
sequentially but was still using the MR versions of the clustering
algorithms, I'll change that now.
:)
I just tried this, and there seems to be a difference in behaviour
between the sequential and MR versions of Canopy. With MR:
* Mapper called for each point, which calls
canopyClusterer.addPointToCanopies(point.get(), canopies); - in my
case 128 canopies are created
* Reducer called with the canopy centroid points, which then calls
canopyClusterer.addPointToCanopies(point, canopies); for each of
these centroids - and I end up with 11 canopies.
And we end up with canopies of canopy centroids.
However, the sequential version doesn't appear to have the equivalent
of the Reducer steps, which means that it contains the original number
of canopies. Should it also compute the "canopies of canopies"? At the
moment, the MR version is working much better for me with the second
canopy generation step, so I'll stick with this for now. I guess it
should be consistent between sequential and MR? I should probably
start a separate thread for this...
I guess I don't quite understand your question. Can you please
elaborate?
Sorry, what I wanted to ask was: is it okay to use
ClusterEvaluator.intraClusterDensity()? Or should only
ClusterEvaluator.interClusterDensity() be used?
I have to leave for the evening, but if you need me to check anything
further here re: canopy I can take a look tomorrow.