Re: Standard Deviation of a Set of Vectors

Jeff Eastman Thu, 30 Sep 2010 10:32:07 -0700

 Derek,

The Canopy implementation was probably one of the first Mahout commits.Its reference implementation performs a single pass over the data and,in your case, produces 128 canopies. It is the correct, published Canopyalgorithm. In order to become scalable, the MR version does this in eachmapper, and then again in the reducer to combine the results of themapper canopies. This approach was taken from a Google presentation,iirc, and it seems to produce good results. At least it has withstoodthe test of time.

When I added the sequential execution mode to canopy, I just used theexisting reference implementation. Now you have noticed that the resultsare quite different when running the MR version beside the sequentialversion.

I'm not sure which knob to turn here: A) try to modify the MR version toperform a single pass; B) add another pass to the sequential version; orC) just document the difference. A is a hard problem (maybe 0.5) and Ban easy change (ok for 0.4). Going for the "low hanging fruit", I'minclined to do B for consistency.


Can we get some opinions on this from the other Mahouts?

Jeff

PS: On the usability of ClusterEvaluator.intraClusterDensity() (vs.CDbwEvaluator.intraClusterDensity() I presume), I don't have an opinion.Both are pretty experimental IMHO and I'd rather not use "should" foreither. It would be interesting to develop some standard data setsagainst which to compare them both under all of the clusteringalgorithms. Perhaps a nice wiki page or technical paper for someone towrite. I think both evaluators can give useful insight. Again, pick yourpoison.


On 9/30/10 12:36 PM, Derek O'Callaghan wrote:

Thanks for the tip, I had been generating the representative pointssequentially but was still using the MR versions of the clusteringalgorithms, I'll change that now.
:)
I just tried this, and there seems to be a difference in behaviourbetween the sequential and MR versions of Canopy. With MR:
   * Mapper called for each point, which calls
     canopyClusterer.addPointToCanopies(point.get(), canopies); - in my
     case 128 canopies are created
   * Reducer called with the canopy centroid points, which then calls
     canopyClusterer.addPointToCanopies(point, canopies); for each of
     these centroids - and I end up with 11 canopies.

And we end up with canopies of canopy centroids.
However, the sequential version doesn't appear to have the equivalentof the Reducer steps, which means that it contains the original numberof canopies. Should it also compute the "canopies of canopies"? At themoment, the MR version is working much better for me with the secondcanopy generation step, so I'll stick with this for now. I guess itshould be consistent between sequential and MR? I should probablystart a separate thread for this...
I guess I don't quite understand your question. Can you pleaseelaborate?
Sorry, what I wanted to ask was: is it okay to useClusterEvaluator.intraClusterDensity()? Or should onlyClusterEvaluator.interClusterDensity() be used?
I have to leave for the evening, but if you need me to check anythingfurther here re: canopy I can take a look tomorrow.

Re: Standard Deviation of a Set of Vectors

Reply via email to