Hello Eric,

>> Anyway, this little Bash script can make manual mail
>> sorting for ASSP learning corpus less effort-intensive.
>>
>> Since sorting moves mail to folders sorted/spam,
>> sorted/notspam, etc, one can stop sorting at any time by
>> pressing Ctrl-C and not lose any sorting results.

> I haven't looked at your script but was wondering if you could give a brief
> explanation as to how it worked?  From my understanding of your msg, you are
> using the bayesian filter to determine if it is
> spam/notspam, in which case, 
> wouldn't it already be sorted properly?

Not really: false positives and false negatives happen and
in initial phases of learning sometimes (meaning: in some
organizations) you really have to sift through all that
manually. False negatives are not a big deal, but false
positives are.

> Futhermore, why do you need to
> connect to the HTTP interface for every msg?  Are you using the Mail
> Analyzer test page to determine if it is spam or not spam?

Yes. My reasoning is if the spam probability is really high,
one can allow for a small risk of deleting false positive,
which obviously amounts to blackholing a legitimiate
message, but that risk is relatively low, while the number
of spams for manual sorting is drastically reduced.

> In which case,
> what difference is there from the time the email actually passed through the
> bayeisan test on its own?  Do you find a lot of email misclassified?

Quite a number actually, though this depends on the
definition obviously - after training the filter with
several hundred mails I (or rather: my coworkers did) still
found a few dozen of false positives for some 1000 mails
altogether (more or less 300 notspam and more or less 500
spam). This may not look like much, but some of those mails
are really important.

Sometimes those are mails from business partners or
potential business partners (purchase inquiries etc) and one
_really_ cannot afford to lose them. This is later
alleviated by the "automatic whitelisting" feature, but not
with people emailing you out of blue about your product.

I may be trying to do "premature optimization" that is the
root of all evil as we know, but I found that if I train
Bayesian filter with as few as 50 spams and 50 notspams this
already gives pretty good results and with careful selection
of spam probability one can throw away a good portion of
spam already and not have to look through it manually.

That script is just designed to make learning curve of
filter steeper really.

--  Best regards,  mrkafk
mailto:[EMAIL PROTECTED]


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Assp-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-user

Reply via email to