Hi Loek, The Ngram sequence considers n sequences of words as a token in the traditional naive bayes sense. Therefore it gives a boost for those words.
The current implementation is based on Jason Rennie's paper on Complementary naive bayes. The optimisations he used(other than the complementary class) is being used in the Naivebayes implementation. Robin On Tue, Jan 26, 2010 at 4:35 PM, Loek Cleophas <[email protected]>wrote: > Hi > > I was looking at the naive Bayes classifier's implementation, due to my > surprise at the n-gram parameter being used. > > My understanding of 'traditional' naive Bayes is that it only considers > probabilities related to single words/tokens, independent of context. Is > that not what the Mahout implementation does? Are the N-grams used to also > model N-sequences of tokens as "words" to be dealt with in the algorithm? Or > are they used as input in some other way? > > It seems it uses "N-grams" of N tokens, not N characters, from what I > gather from NGrams.java. Or are they not related to token sequences but to > character sequences somehow? > > Any help or pointers to materials the implementation is based on would be > appreciated. (I know that the Complementary Naive Bayes implementation is > quite different and based on a paper introducing that method - but I'm > wondering about the 'normal' Naive Bayes implementation.) > > Regards, > Loek >
