Andrew Palumbo created MAHOUT-1564:
--------------------------------------

             Summary: Naive Bayes Classifier for New Text Documents
                 Key: MAHOUT-1564
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1564
             Project: Mahout
          Issue Type: Improvement
    Affects Versions: 0.9
            Reporter: Andrew Palumbo
             Fix For: 1.0


MapReduce Naive Bayes implementation currently lacks the ability to classify a 
new document (outside of the training/holdout corpus).  I've begun some work on 
a "ClassifyNew" job which will do the following:

1. Vectorize a new text document using the dictionary and document frequencies 
from the training/holdout corpus 
    - assuming the original corpus was vectorized using `seq2sparse`, step 
        (1) will use all of the same parameters. 
2. Score and Label a new document using a previously trained model.

I think that it will be a useful addition to the NB package.  Unfortunately, 
this is going to be mostly MR workhorse code and doesn't really introduce much 
new logic. I will try to keep any new logic separate from MR code so that it 
could be used by MAHOUT-1493.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to