Hi, On May/07/2010, Thomas Dunham wrote: > Thanks Carles, will try to force some of this into my head this > weekend....
good! For me one of the keys is in: http://en.wikipedia.org/wiki/Naive_Bayes_classifier Just above "Using the Bayesian result". I can read the formulas like: -Probability of the document being spam is the multiplicatoin of each individual word of this document being spam Also interesting here: http://en.wikipedia.org/wiki/Bayesian_spam_filtering When talks about "Combining individual probabilities" (talks about the assumptions and links to the previous Wikipedia article) Other key is in the file reverend/thomas.py, buildCache, where it computes the probability of each token to belong in each group. The thing is that there is doing some "magic" with the metrics that, at the moment, I'm not following very well (what it does and why is needed). So, in a very high level does: Training: -Tokenize the input -Save how many times appears each word in the corpus buildCache (so, part of guessing if no more training is done): -Calculates, per token, how likely is to be in each category (and something else that I'm not following with the good and badMetric guesser: -Tokenize the new input -Combines the probabilities of each token of the input, using the cache to know how likely is this token to be of each category. I think that this is a very high level design with some mistake for sure. If someone can calculate one example by hand and the result is the same than Reverend would get some extra points :-D I'm only quite confused with some things in buildCache... -- Carles Pina i Estany http://pinux.info _______________________________________________ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk