If there isn't much demand here, I would just document the difference rather
than converge them.

On Fri, Oct 1, 2010 at 1:46 AM, Derek O'Callaghan
<[email protected]>wrote:

> I'd also be inclined to have the B option for consistency, although I get
> the feeling that not too many people are using the sequential version, so
> perhaps just documenting it is enough for now if there are higher priorities
> for 0.4.
>
> Derek
>
> On 30/09/10 18:31, Jeff Eastman wrote:
>
>>  Derek,
>>
>> The Canopy implementation was probably one of the first Mahout commits.
>> Its reference implementation performs a single pass over the data and, in
>> your case, produces 128 canopies. It is the correct, published Canopy
>> algorithm. In order to become scalable, the MR version does this in each
>> mapper, and then again in the reducer to combine the results of the mapper
>> canopies. This approach was taken from a Google presentation, iirc, and it
>> seems to produce good results. At least it has withstood the test of time.
>>
>> When I added the sequential execution mode to canopy, I just used the
>> existing reference implementation. Now you have noticed that the results are
>> quite different when running the MR version beside the sequential version.
>>
>> I'm not sure which knob to turn here: A) try to modify the MR version to
>> perform a single pass; B) add another pass to the sequential version; or C)
>> just document the difference. A is a hard problem (maybe 0.5) and B an easy
>> change (ok for 0.4). Going for the "low hanging fruit", I'm inclined to do B
>> for consistency.
>
>

Reply via email to