[ 
https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615040#action_12615040
 ] 

Ted Dunning commented on MAHOUT-60:
-----------------------------------

Classifying a single document isn't particularly an interesting task to 
parallelize since it is already so fast.

The interesting parallel tasks are training and batch classification.  This is 
pretty much as you say.  For batch classification, I would find it tempting to 
have each map do a single document classification and emit the result.  At that 
point, you have trivial parallelism and no need for a reduce.  You need to have 
a bit of a lookup table on each mapper, but this isn't usually all that big 
(typically only thousands of interesting term weights, possibly hundreds of 
thousands for some kinds of application).  Not only do you not need the reduce, 
but you don't need three phases of map-reduce either.

Training is a different matter since it involves data that is found in one way 
(terms in documents) that needs to be aggregated another way (terms for 
different categories).  That is natural for map-reduce as well.



> Complementary Naive Bayes
> -------------------------
>
>                 Key: MAHOUT-60
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-60
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Classification
>            Reporter: Robin Anil
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.1
>
>         Attachments: MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch, 
> twcnb.jpg
>
>
> The focus is to implement an improved text classifier based on this paper 
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to