Hi All,

I'm back very relaxed and tan and was a bit intimidated by the 750+ postings but I waded through them and it looks like MAHOUT-479 is the thing for me to focus on in the short term. In that regard, and with the recent refactoring of all the clustering data structures around AbstractCluster, plus the driver changes to unify under AbstractJob, can somebody net out the remaining areas for improvement?

Currently, I see an overlap between Model and Cluster which is evident when all Models support Cluster and all AbstractClusters support similar observation methods to those in Model. It might be that factoring out an Observable interface from Model and making AbstractCluster support it would clean this up but it bears further discussion. Something like:

// unifies common operations relating to observing posterior data
interface Observable {
  void observe(Vector, double);
  void observe(Vector);
  void observe(ClusterObservations);
  void computeParameters();
  ClusterObservations getObservations();
}

// unifies common attributes needed by ClusterDumper and graphical display routines
interface Cluster {
  int getId();
  Vector getCenter();
  Vector getRadius();
  int getNumPoints();
  String asFormatString(String[]);
  String asJsonString();
}

// specific to Dirichlet-style probabilistic clusters
interface Model extends Writable, Cluster, Observable {
  double pdf(Vector);
}

// base class for non-Dirichlet clustering implementations
abstract AbstractCluster implements Writable, Cluster, Observable {}

This would offer some improvement IMHO but I'm not clear if this would improve plug-n-playness with Classification or not. Is there a more theoretical "DistanceMeasure"-like discussion that would be germane?

Jeff

Reply via email to