Re: [jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Jeff Eastman Tue, 17 Aug 2010 14:06:39 -0700

 Hi Ted,

I've made significant progress on both of these issues in commits latelast month. At the command level, all the clustering drivers now inheritfrom AbstractJob and have their common parameters factored into theDefaultOptionCreator for API consistency. All drivers also support abuildClusters() method which processes input vectors to produce theirrespective Cluster models in persistent storage (clusters-i directory),plus a clusterData() method that reads those models and performs actualclustering of the input vectors. The sequence files in the clusters-idirectories can be read uniformly by the ClusterDumper and otherutilities as they all support the Cluster interface.

The clusterData() process for most algorithms produces a single,most-likely cluster assignment, usually the closest cluster. ForDirichlet and FuzzyK, the clustering can be specified to use themost-likely assignment (the default) or a pdf threshold can be specifiedabove which multiple cluster assignments will be output. AllclusterData() processes produce WeightedVectorWritable objects inpersistent storage which contain a probability weight and the inputvector. These sequence files are keyed by the clusterId and are outputto the clusteredPoints directory.

The buildClusters() step is always run from the command line but theclusterData step is optional (-cl flag). It would be straightforward tosupport the other use case (clusterData only). Users who instantiate thedrivers from Java code can call either/both at their discretion now.

I've also implemented an execution method (-xm) parameter on allclustering drivers which allows the sequential, in-memory referenceimplementation to be invoked from the command line using the samearguments as the mapreduce implementation. The display examples usethese now, except Dirichlet which I didn't get to before I left.


Given this information, what do you now see as logical next steps?
Jeff


On 8/17/10 9:31 AM, Ted Dunning wrote:

Jeff,

You asked about clustering things to do.

In my mind, there are two clustering issues.  One is unification at the
command level where clusters are learned.  The other is unification in
subsequent steps where somebody might want to use a clustering.  The second
issue actually seems a bit more pressing to me.

That second issue concerns the ability to have a model that is the output of
the clustering.  That model should support:

- reading the model from persistent storage

- classifying new vectors to get either a single best-fit cluster or a score
vector.


In my view, this should apply equally to all classifiers and the models
produced by classifier learning algorithms should be the same at the
interface level as the models produced by cluster learning algorithms.


On Tue, Aug 17, 2010 at 9:26 AM, Ted Dunning (JIRA)<j...@apache.org>  wrote:

    [
https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899452#action_12899452]

Ted Dunning commented on MAHOUT-479:
------------------------------------

I just moved the encoding objects associated with MAHOUT-228 to
org.apache.mahout.vectors to provide a nucleus for feature encoding.

There are also a fair number of things in oam.text and oam.utils that are
related.  Since those are in the utils module, however, I couldn't leverage
them.  We may want to consider moving some of them to core to allow wider
use.

Streamline classification/ clustering data structures
-----------------------------------------------------

                 Key: MAHOUT-479
                 URL: https://issues.apache.org/jira/browse/MAHOUT-479
             Project: Mahout
          Issue Type: Improvement
          Components: Classification, Clustering
    Affects Versions: 0.1, 0.2, 0.3, 0.4
            Reporter: Isabel Drost

Opening this JIRA issue to collect ideas on how to streamline our

classification and clustering algorithms to make integration for users
easier as per mailing list thread
http://markmail.org/message/pnzvrqpv5226twfs

{quote}
Jake and Robin and I were talking the other evening and a common lament

was that our classification (and clustering) stuff was all over the map in
terms of data structures.  Driving that to rest and getting those comments
even vaguely as plug and play as our much more advanced recommendation
components would be very, very helpful.

{quote}
This issue probably also realates to MAHOUT-287 (intention there is to

make naive bayes run on vectors as input).

Ted, Jake, Robin: Would be great if someone of you could add a comment on

some of the issues you discussed "the other evening" and (if applicable) any
minor or major changes you think could help solve this issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-479) Streamline classification/ clustering data structures

Reply via email to