[
https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615040#action_12615040
]
Ted Dunning commented on MAHOUT-60:
-----------------------------------
Classifying a single document isn't particularly an interesting task to
parallelize since it is already so fast.
The interesting parallel tasks are training and batch classification. This is
pretty much as you say. For batch classification, I would find it tempting to
have each map do a single document classification and emit the result. At that
point, you have trivial parallelism and no need for a reduce. You need to have
a bit of a lookup table on each mapper, but this isn't usually all that big
(typically only thousands of interesting term weights, possibly hundreds of
thousands for some kinds of application). Not only do you not need the reduce,
but you don't need three phases of map-reduce either.
Training is a different matter since it involves data that is found in one way
(terms in documents) that needs to be aggregated another way (terms for
different categories). That is natural for map-reduce as well.
> Complementary Naive Bayes
> -------------------------
>
> Key: MAHOUT-60
> URL: https://issues.apache.org/jira/browse/MAHOUT-60
> Project: Mahout
> Issue Type: Sub-task
> Components: Classification
> Reporter: Robin Anil
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.1
>
> Attachments: MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch,
> twcnb.jpg
>
>
> The focus is to implement an improved text classifier based on this paper
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.