Clustering Issues

Jeff Eastman Mon, 16 Aug 2010 15:49:05 -0700

 Hi All,

I'm back very relaxed and tan and was a bit intimidated by the 750+postings but I waded through them and it looks like MAHOUT-479 is thething for me to focus on in the short term. In that regard, and with therecent refactoring of all the clustering data structures aroundAbstractCluster, plus the driver changes to unify under AbstractJob, cansomebody net out the remaining areas for improvement?

Currently, I see an overlap between Model and Cluster which is evidentwhen all Models support Cluster and all AbstractClusters support similarobservation methods to those in Model. It might be that factoring out anObservable interface from Model and making AbstractCluster support itwould clean this up but it bears further discussion. Something like:


// unifies common operations relating to observing posterior data
interface Observable {
  void observe(Vector, double);
  void observe(Vector);
  void observe(ClusterObservations);
  void computeParameters();
  ClusterObservations getObservations();
}

// unifies common attributes needed by ClusterDumper and graphicaldisplay routines

interface Cluster {
  int getId();
  Vector getCenter();
  Vector getRadius();
  int getNumPoints();
  String asFormatString(String[]);
  String asJsonString();
}

// specific to Dirichlet-style probabilistic clusters
interface Model extends Writable, Cluster, Observable {
  double pdf(Vector);
}

// base class for non-Dirichlet clustering implementations
abstract AbstractCluster implements Writable, Cluster, Observable {}

This would offer some improvement IMHO but I'm not clear if this wouldimprove plug-n-playness with Classification or not. Is there a moretheoretical "DistanceMeasure"-like discussion that would be germane?


Jeff

Clustering Issues

Reply via email to