[jira] [Updated] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents

Andrew Palumbo (JIRA) Fri, 19 Dec 2014 14:20:46 -0800

     [ 
https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Palumbo updated MAHOUT-1564:
-----------------------------------
    Description: 
MapReduce and DSL Naive Bayes implementations currently lack the ability to 
classify a new document (outside of the training/holdout corpus).  This New 
feature will do the following.

1. Vectorize a new text document using the dictionary and document frequencies 
from the training/holdout corpus 
    - assume the original corpus was vectorized using `seq2sparse`; step (1) 
will use all of the same parameters. 

2. Score and label a new document using a previously trained model.

This effort will need to be done in parallel for MRLegacy and DSL 
implementations.  Neither should be too much work.

  was:
MapReduce Naive Bayes implementation currently lacks the ability to classify a 
new document (outside of the training/holdout corpus).  I've begun some work on 
a "ClassifyNew" job which will do the following:

1. Vectorize a new text document using the dictionary and document frequencies 
from the training/holdout corpus 
    - assume the original corpus was vectorized using `seq2sparse`; step (1) 
will use all of the same parameters. 

2. Score and label a new document using a previously trained model.

I think that it will be a useful addition to the NB package.  Unfortunately, 
this is going to be mostly MR workhorse code and doesn't really introduce much 
new logic. I will try to keep any new logic separate from MR code so that it 
can be called from scala for MAHOUT-1493.


> Naive Bayes Classifier for New Text Documents
> ---------------------------------------------
>
>                 Key: MAHOUT-1564
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1564
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.9
>            Reporter: Andrew Palumbo
>             Fix For: 1.0
>
>
> MapReduce and DSL Naive Bayes implementations currently lack the ability to 
> classify a new document (outside of the training/holdout corpus).  This New 
> feature will do the following.
> 1. Vectorize a new text document using the dictionary and document 
> frequencies from the training/holdout corpus 
>     - assume the original corpus was vectorized using `seq2sparse`; step (1) 
> will use all of the same parameters. 
> 2. Score and label a new document using a previously trained model.
> This effort will need to be done in parallel for MRLegacy and DSL 
> implementations.  Neither should be too much work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents

Reply via email to