Hello Eric, >> Anyway, this little Bash script can make manual mail >> sorting for ASSP learning corpus less effort-intensive. >> >> Since sorting moves mail to folders sorted/spam, >> sorted/notspam, etc, one can stop sorting at any time by >> pressing Ctrl-C and not lose any sorting results.
> I haven't looked at your script but was wondering if you could give a brief > explanation as to how it worked? From my understanding of your msg, you are > using the bayesian filter to determine if it is > spam/notspam, in which case, > wouldn't it already be sorted properly? Not really: false positives and false negatives happen and in initial phases of learning sometimes (meaning: in some organizations) you really have to sift through all that manually. False negatives are not a big deal, but false positives are. > Futhermore, why do you need to > connect to the HTTP interface for every msg? Are you using the Mail > Analyzer test page to determine if it is spam or not spam? Yes. My reasoning is if the spam probability is really high, one can allow for a small risk of deleting false positive, which obviously amounts to blackholing a legitimiate message, but that risk is relatively low, while the number of spams for manual sorting is drastically reduced. > In which case, > what difference is there from the time the email actually passed through the > bayeisan test on its own? Do you find a lot of email misclassified? Quite a number actually, though this depends on the definition obviously - after training the filter with several hundred mails I (or rather: my coworkers did) still found a few dozen of false positives for some 1000 mails altogether (more or less 300 notspam and more or less 500 spam). This may not look like much, but some of those mails are really important. Sometimes those are mails from business partners or potential business partners (purchase inquiries etc) and one _really_ cannot afford to lose them. This is later alleviated by the "automatic whitelisting" feature, but not with people emailing you out of blue about your product. I may be trying to do "premature optimization" that is the root of all evil as we know, but I found that if I train Bayesian filter with as few as 50 spams and 50 notspams this already gives pretty good results and with careful selection of spam probability one can throw away a good portion of spam already and not have to look through it manually. That script is just designed to make learning curve of filter steeper really. -- Best regards, mrkafk mailto:[EMAIL PROTECTED] ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Assp-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/assp-user
