Re: Standard Deviation of a Set of Vectors

Derek O'Callaghan Fri, 01 Oct 2010 01:47:26 -0700

Hi Jeff,

Thanks for the info on Canopy. In my case, given that I'm seeing betterresults with the MR version, I'll stick with that for now. I'd also beinclined to have the B option for consistency, although I get thefeeling that not too many people are using the sequential version, soperhaps just documenting it is enough for now if there are higherpriorities for 0.4.


Derek

On 30/09/10 18:31, Jeff Eastman wrote:

 Derek,
The Canopy implementation was probably one of the first Mahoutcommits. Its reference implementation performs a single pass over thedata and, in your case, produces 128 canopies. It is the correct,published Canopy algorithm. In order to become scalable, the MRversion does this in each mapper, and then again in the reducer tocombine the results of the mapper canopies. This approach was takenfrom a Google presentation, iirc, and it seems to produce goodresults. At least it has withstood the test of time.
When I added the sequential execution mode to canopy, I just used theexisting reference implementation. Now you have noticed that theresults are quite different when running the MR version beside thesequential version.
I'm not sure which knob to turn here: A) try to modify the MR versionto perform a single pass; B) add another pass to the sequentialversion; or C) just document the difference. A is a hard problem(maybe 0.5) and B an easy change (ok for 0.4). Going for the "lowhanging fruit", I'm inclined to do B for consistency.
Can we get some opinions on this from the other Mahouts?

Jeff
PS: On the usability of ClusterEvaluator.intraClusterDensity() (vs.CDbwEvaluator.intraClusterDensity() I presume), I don't have anopinion. Both are pretty experimental IMHO and I'd rather not use"should" for either. It would be interesting to develop some standarddata sets against which to compare them both under all of theclustering algorithms. Perhaps a nice wiki page or technical paper forsomeone to write. I think both evaluators can give useful insight.Again, pick your poison.
On 9/30/10 12:36 PM, Derek O'Callaghan wrote:
Thanks for the tip, I had been generating the representative pointssequentially but was still using the MR versions of the clusteringalgorithms, I'll change that now.
:)
I just tried this, and there seems to be a difference in behaviourbetween the sequential and MR versions of Canopy. With MR:
   * Mapper called for each point, which calls
     canopyClusterer.addPointToCanopies(point.get(), canopies); - in my
     case 128 canopies are created
   * Reducer called with the canopy centroid points, which then calls
     canopyClusterer.addPointToCanopies(point, canopies); for each of
     these centroids - and I end up with 11 canopies.

And we end up with canopies of canopy centroids.
However, the sequential version doesn't appear to have the equivalentof the Reducer steps, which means that it contains the originalnumber of canopies. Should it also compute the "canopies ofcanopies"? At the moment, the MR version is working much better forme with the second canopy generation step, so I'll stick with thisfor now. I guess it should be consistent between sequential and MR? Ishould probably start a separate thread for this...
I guess I don't quite understand your question. Can you pleaseelaborate?
Sorry, what I wanted to ask was: is it okay to useClusterEvaluator.intraClusterDensity()? Or should onlyClusterEvaluator.interClusterDensity() be used?
I have to leave for the evening, but if you need me to check anythingfurther here re: canopy I can take a look tomorrow.

Re: Standard Deviation of a Set of Vectors

Reply via email to