Re: [jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Jeff Eastman Tue, 17 Aug 2010 15:35:39 -0700

Ok, so if a new ClusterClassifier could read a set of Clusters fromclusters-i storage into memory and do what is now being done in eachdriver/clusterer then this would facilitate the integration? It's just adifferent slice of the current clusteredPoints output; a vector ofprobabilities for the given input vector and Clusters. Sounds like thiswould not need a command line interface at all, just a Path reference tothe clusters to be read-in at initialization time.

Dirichlet models already implement pdf(Vector) methods which supportclassification from their persistent state. The other types of Clustercannot since their persistent state does not include the DistanceMeasurethey used. Some further refactoring of Cluster, AbstractCluster andModel along the lines I discussed earlier might make all this cometogether. I think the current set of Dirichlet models needs to becleaned up anyway; AbstractClusters look way too much likeAsymmetricSampledNormalDistributions to ignore the redundancy. I willnoodle on this some more and see where it takes me.


I get the real-time requirement.

On 8/17/10 2:30 PM, Ted Dunning wrote:

I think it is important to be able to load a classification object up that
implements something like AbstractVectorClassifier.

The use case I have in mind is real-time classification.  Here, we would
need to accept input, convert to vector form and get a classification output
from a model
for a single input at a time, typically inside some kind of web-service.

The model could come from supervised learning (classifier) or unsupervised
learning (clusterer).

Clusters are commonly used as features for classifiers.  Classifiers trained
on some external result are also used as features.  Thus we need to be able
to load several models, evaluate some on the raw input and then evaluate
others on the outputs of the first ones as well as the rest of the feature
vector.

Model learning and clustering are typically done off-line and the current
very shiny and new command-line interface is probably fine for that.

Model deployment is another matter and there a real-time capability is a
must.

On Tue, Aug 17, 2010 at 2:06 PM, Jeff Eastman<j...@windwardsolutions.com>wrote:

The clusterData() process for most algorithms produces a single,
most-likely cluster assignment, usually the closest cluster. For Dirichlet
and FuzzyK, the clustering can be specified to use the most-likely
assignment (the default) or a pdf threshold can be specified above which
multiple cluster assignments will be output. All clusterData() processes
produce WeightedVectorWritable objects in persistent storage which contain a
probability weight and the input vector. These sequence files are keyed by
the clusterId and are output to the clusteredPoints directory.

The buildClusters() step is always run from the command line but the
clusterData step is optional (-cl flag). It would be straightforward to
support the other use case (clusterData only). Users who instantiate the
drivers from Java code can call either/both at their discretion now.

I've also implemented an execution method (-xm) parameter on all clustering
drivers which allows the sequential, in-memory reference implementation to
be invoked from the command line using the same arguments as the mapreduce
implementation. The display examples use these now, except Dirichlet which I
didn't get to before I left.

Given this information, what do you now see as logical next steps?

Re: [jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Reply via email to