[ 
https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020727#comment-13020727
 ] 

Jeff Eastman commented on MAHOUT-479:
-------------------------------------

Here's an interesting piece of code. It nets out the clustering process in 
terms of the ClusterClassifier, which is an AbstractVectorClassifier which 
implements OnlineLearner and Writable. The policy for kmeans just returns the 
index of the max pdf element and the policy for dirichlet returns the 
multinomial of the mixture times the pdf. The really exciting thing to me is 
how, for clustering, we classify and then train whereas for classification it 
is the opposite order. Very symmetric and, uh, now rather obvious.

{code}
  private ClusteringPolicy policy;

  /**
   * Iterate over data using a prior-trained classifier, for a number of 
iterations
   * @param data a List<Vector> of input vectors
   * @param prior the prior-trained ClusterClassifier
   * @param numIterations the int number of iterations to perform
   * @return the posterior ClusterClassifier
   */
  public ClusterClassifier iterate(List<Vector> data, ClusterClassifier prior, 
int numIterations) {
    for (int iteration = 1; iteration <= numIterations; iteration++) {
      for (Vector vector : data) {
        // classification yields probabilities
        Vector pdfs = prior.classify(vector);
        // policy selects a model given those probabilities
        int selected = policy.select(pdfs);
        // training causes all models to observe data
        prior.train(selected, vector);
      }
      // compute the posterior models
      prior.close();
      // update the policy
      policy.update(prior);
      }
    }
    return prior;
  }
{code}

> Streamline classification/ clustering data structures
> -----------------------------------------------------
>
>                 Key: MAHOUT-479
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-479
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.1, 0.2, 0.3, 0.4
>            Reporter: Isabel Drost
>            Assignee: Isabel Drost
>
> Opening this JIRA issue to collect ideas on how to streamline our 
> classification and clustering algorithms to make integration for users easier 
> as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs
> {quote}
> Jake and Robin and I were talking the other evening and a common lament was 
> that our classification (and clustering) stuff was all over the map in terms 
> of data structures.  Driving that to rest and getting those comments even 
> vaguely as plug and play as our much more advanced recommendation components 
> would be very, very helpful.
> {quote}
> This issue probably also realates to MAHOUT-287 (intention there is to make 
> naive bayes run on vectors as input).
> Ted, Jake, Robin: Would be great if someone of you could add a comment on 
> some of the issues you discussed "the other evening" and (if applicable) any 
> minor or major changes you think could help solve this issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to