Good to hear. The current implementation is actually the first one I did, so it was easy to revert to that model. It does require the mapper to retain all of the canopies; however, and this could create an OOME if the T values are poorly chosen. Doing the centroid calculation in the combiner removed this difficulty but the Hadoop semantics change makes it a non-starter. If there was some globally-unique way to create new cluster identifiers as they are needed, the centroid calculation could be moved to the reducer. There would still be a need to combine the clusters created by each of the mappers...

Jeff


Adil Aijaz wrote:
Jeff,

Thanks for the quick turnaround on this issue. Just tested it and the canopy creation and kmeans both work now on syntheticcontroldata. I get 7 canopies and 7 clusters. Collection logic in close() is not pretty but can't think of a workaround myself.

adil

Jeff Eastman wrote:
r783617 removed the CanopyCombiner and refactored its semantics back into the reducer. Updated unit tests pass and Synthetic Control with Canopy produces 6 clusters. Kmeans also runs produces 6 clusters too. I really don't like doing stuff in close() but see no practical alternative. Ideas are still welcomed.

Jeff


Jeff Eastman wrote:
Adil Aijaz wrote:
2. There is a bug in examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java that called runJob from main function with my provided arguments transposed. So, my convergenceDelta was interpreted as t1, t1 as t2, and t2 as convergenceDelta. I will commit a patch as soon as I get approval for opensource commits from my employer, however, I thought I'd put it out there in case someone else is going through the same issue.

r783585 fixed the parameter ordering bug. Still working on the Combiner problem.

Thanks Adil,
Jeff







Reply via email to