[
https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388995#comment-14388995
]
ASF GitHub Bot commented on MAHOUT-1564:
----------------------------------------
GitHub user andrewpalumbo opened a pull request:
https://github.com/apache/mahout/pull/91
MAHOUT-1564 Naive Bayes Classifier for New Text Documents
I've decided to add this as a spark-shell script example. I havent heard
much of a call for large scale it can serve as an example of running mahout
spark-shell scripts, and is can be easily adapted to an application.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1564-example
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/mahout/pull/91.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #91
----
commit a87bbc1b309d1d952da1cb7a7a141dd95b542e9f
Author: Andrew Palumbo <[email protected]>
Date: 2015-03-31T17:40:16Z
add NB document classifier script to the examples dir
----
> Naive Bayes Classifier for New Text Documents
> ---------------------------------------------
>
> Key: MAHOUT-1564
> URL: https://issues.apache.org/jira/browse/MAHOUT-1564
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.9
> Reporter: Andrew Palumbo
> Assignee: Andrew Palumbo
> Labels: DSL, legacy, scala, spark
> Fix For: 0.10.1, 0.10.0
>
>
> MapReduce and DSL Naive Bayes implementations currently lack the ability to
> classify a new document (outside of the training/holdout corpus). This New
> feature will do the following.
> 1. Vectorize a new text document using the dictionary and document
> frequencies from the training/holdout corpus
> - assume the original corpus was vectorized using `seq2sparse`; step (1)
> will use all of the same parameters.
> 2. Score and label a new document using a previously trained model.
> This effort will need to be done in parallel for MRLegacy and DSL
> implementations. Neither should be too much work.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)