I would really appreciate some help understanding the format that IMail uses for their bayesian implementation.

It's fairly clear that the probability follows the word, but after that there are two additional columns that I don't understand.  Maybe one or both is a count of occurrences in the seed data, either ham or spam according to the column, and the first line shows the totals for ham and spam to generate the probabilities with?

I am also not sure about what the range for the probabilities is.  They are clearly inferring a decimal value with absolute ham being 0, but I'm not sure if they would consider absolute spam a 0.5 or some other value.

Here's a sample of the file, showing the very top and very bottom of the ~275,000 entries:
263615,731600
specializing,10264,362,1254
graciously,583326,2063,93
bringing,616976,2631,3128
mbps,1633464,367,321
tantra,2811106,96,38
expiration,3319151,712,1504
exports,3327519,283,173
windowsnt,3735177,178,2
matching,3743773,1767,3397
...
colorredour,3290975262,0,13
attentionhello,3300283604,0,13
styleborderbott,3340227216,0,25
danjohnsonilsen,3421842793,0,24
nextpartccafcbc,3651767529,0,48
swingilsenet,3916557290,0,24
zimbabwan,4042354847,0,13
httpwwwasandoxc,4191261084,0,14


Thanks,

Matt
-- 
=====================================================
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/
=====================================================


Reply via email to