Good to hear. The current implementation is actually the first one I
did, so it was easy to revert to that model. It does require the mapper
to retain all of the canopies; however, and this could create an OOME if
the T values are poorly chosen. Doing the centroid calculation in the
combiner removed this difficulty but the Hadoop semantics change makes
it a non-starter. If there was some globally-unique way to create new
cluster identifiers as they are needed, the centroid calculation could
be moved to the reducer. There would still be a need to combine the
clusters created by each of the mappers...
Jeff
Adil Aijaz wrote:
Jeff,
Thanks for the quick turnaround on this issue. Just tested it and the
canopy creation and kmeans both work now on syntheticcontroldata. I
get 7 canopies and 7 clusters. Collection logic in close() is not
pretty but can't think of a workaround myself.
adil
Jeff Eastman wrote:
r783617 removed the CanopyCombiner and refactored its semantics back
into the reducer. Updated unit tests pass and Synthetic Control with
Canopy produces 6 clusters. Kmeans also runs produces 6 clusters too.
I really don't like doing stuff in close() but see no practical
alternative. Ideas are still welcomed.
Jeff
Jeff Eastman wrote:
Adil Aijaz wrote:
2. There is a bug in
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
that called runJob from main function with my provided arguments
transposed. So, my convergenceDelta was interpreted as t1, t1 as
t2, and t2 as convergenceDelta. I will commit a patch as soon as I
get approval for opensource commits from my employer, however, I
thought I'd put it out there in case someone else is going through
the same issue.
r783585 fixed the parameter ordering bug. Still working on the
Combiner problem.
Thanks Adil,
Jeff