Ok, so if a new ClusterClassifier could read a set of Clusters from
clusters-i storage into memory and do what is now being done in each
driver/clusterer then this would facilitate the integration? It's just a
different slice of the current clusteredPoints output; a vector of
probabilities for the given input vector and Clusters. Sounds like this
would not need a command line interface at all, just a Path reference to
the clusters to be read-in at initialization time.
Dirichlet models already implement pdf(Vector) methods which support
classification from their persistent state. The other types of Cluster
cannot since their persistent state does not include the DistanceMeasure
they used. Some further refactoring of Cluster, AbstractCluster and
Model along the lines I discussed earlier might make all this come
together. I think the current set of Dirichlet models needs to be
cleaned up anyway; AbstractClusters look way too much like
AsymmetricSampledNormalDistributions to ignore the redundancy. I will
noodle on this some more and see where it takes me.
I get the real-time requirement.
On 8/17/10 2:30 PM, Ted Dunning wrote:
I think it is important to be able to load a classification object up that
implements something like AbstractVectorClassifier.
The use case I have in mind is real-time classification. Here, we would
need to accept input, convert to vector form and get a classification output
from a model
for a single input at a time, typically inside some kind of web-service.
The model could come from supervised learning (classifier) or unsupervised
learning (clusterer).
Clusters are commonly used as features for classifiers. Classifiers trained
on some external result are also used as features. Thus we need to be able
to load several models, evaluate some on the raw input and then evaluate
others on the outputs of the first ones as well as the rest of the feature
vector.
Model learning and clustering are typically done off-line and the current
very shiny and new command-line interface is probably fine for that.
Model deployment is another matter and there a real-time capability is a
must.
On Tue, Aug 17, 2010 at 2:06 PM, Jeff Eastman<j...@windwardsolutions.com>wrote:
The clusterData() process for most algorithms produces a single,
most-likely cluster assignment, usually the closest cluster. For Dirichlet
and FuzzyK, the clustering can be specified to use the most-likely
assignment (the default) or a pdf threshold can be specified above which
multiple cluster assignments will be output. All clusterData() processes
produce WeightedVectorWritable objects in persistent storage which contain a
probability weight and the input vector. These sequence files are keyed by
the clusterId and are output to the clusteredPoints directory.
The buildClusters() step is always run from the command line but the
clusterData step is optional (-cl flag). It would be straightforward to
support the other use case (clusterData only). Users who instantiate the
drivers from Java code can call either/both at their discretion now.
I've also implemented an execution method (-xm) parameter on all clustering
drivers which allows the sequential, in-memory reference implementation to
be invoked from the command line using the same arguments as the mapreduce
implementation. The display examples use these now, except Dirichlet which I
didn't get to before I left.
Given this information, what do you now see as logical next steps?