If there isn't much demand here, I would just document the difference rather than converge them.
On Fri, Oct 1, 2010 at 1:46 AM, Derek O'Callaghan <[email protected]>wrote: > I'd also be inclined to have the B option for consistency, although I get > the feeling that not too many people are using the sequential version, so > perhaps just documenting it is enough for now if there are higher priorities > for 0.4. > > Derek > > On 30/09/10 18:31, Jeff Eastman wrote: > >> Derek, >> >> The Canopy implementation was probably one of the first Mahout commits. >> Its reference implementation performs a single pass over the data and, in >> your case, produces 128 canopies. It is the correct, published Canopy >> algorithm. In order to become scalable, the MR version does this in each >> mapper, and then again in the reducer to combine the results of the mapper >> canopies. This approach was taken from a Google presentation, iirc, and it >> seems to produce good results. At least it has withstood the test of time. >> >> When I added the sequential execution mode to canopy, I just used the >> existing reference implementation. Now you have noticed that the results are >> quite different when running the MR version beside the sequential version. >> >> I'm not sure which knob to turn here: A) try to modify the MR version to >> perform a single pass; B) add another pass to the sequential version; or C) >> just document the difference. A is a hard problem (maybe 0.5) and B an easy >> change (ok for 0.4). Going for the "low hanging fruit", I'm inclined to do B >> for consistency. > >
