Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Bayesian (https://cwiki.apache.org/confluence/display/MAHOUT/Bayesian)


Edited by Grant Ingersoll:
---------------------------------------------------------------------
h1. Intro

Mahout currently has two implementations of Bayesian classifiers.  One is the 
traditional Naive Bayes approach, and the other is called Complementary Naive 
Bayes.

h1. Implementations

[NaiveBayes] ([MAHOUT-9|http://issues.apache.org/jira/browse/MAHOUT-9])

[Complementary Naive Bayes] 
([MAHOUT-60|http://issues.apache.org/jira/browse/MAHOUT-60])

The Naive Bayes implementations in Mahout follow the paper 
[http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf] Before we get to the 
actual algorithm lets discuss the terminology

Given, in an input set of classified documents: 
# j = 0 to N features 
# k = 0 to L labels

Then:

# Normalized Frequency for a term(feature) in a document is calculated by 
dividing the term frequency by the root mean square of terms frequencies in 
that document
# Weight Normalized Tf for a given feature in a given label = sum of Normalized 
Frequency of the feature across all the documents in the label. 
# Weight Normalized Tf-Idf for a given feature in a label is the Tf-idf 
calculated using standard idf multiplied by the Weight Normalized Tf

Once Weight Normalized Tf-idf(W-N-Tf-idf) is calculated, the final weight 
matrix for Bayes and Cbayes are calculated as follows

We calculate the sum of W-N-Tf-idf for all the features in a label called as 
Sigma_k or sumLabelWeight

For Bayes
{noformat}
Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N  ) ]
{noformat}
For CBayes

We calculate the Sum of W-N-Tf-Idf across all labels for a given feature. We 
call this sumFeatureWeight of Sigma_j
Also we sum the entire W-N-Tf-Idf weights for all feature,label pair in the 
train set. Call this Sigma_jSigma_k

Final Weight is calculated as
{noformat}
Weight = Log [ ( Sigma_j - W-N-Tf-Idf + alpha_i ) / ( Sigma_jSigma_k - Sigma_k 
+ N  ) ]
{noformat}

h1. Examples

In Mahout's example code, there are two samples that can be used:

# [Wikipedia Bayes Example] - Classify Wikipedia data.

# [Twenty Newsgroups] - Classify the classic Twenty Newsgroups data.


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to