John, On 5/5/21 4:55 PM, John Hardin wrote: > > > That said, what we really need is ham in non-English languages. If > there's any way you can get more good (accurately classified) > non-English ham, that would be the greatest benefit.
my ham is mostly, would say about 90%+, German. Spam is mostly English but also quite some Italian, Spanish and French. From time to time Russian or Chinese :-) > > Do you know anyone (perhaps family members) who would trust you with a > copy of their ham emails to add to your corpus? sure there are but I'm not so sure that their judgement related to spam/phish can be trusted without massive manual intervention ;-) > > Is your ham corpus limited to what you've used to train Bayes? Or do > you really get that little email? Put more in. About the only > properly-classified ham I *wouldn't* put into masscheck corpora would > be emails discussing spam (e.g. the SA users list is a big no-no). my ham is what ended in my inbox and has not been sorted out as spam. Most of the mails I got daily are from mailinglists but those get automoved (thanks to sieve) into subfolders which do not end in my ham corpus. My inbox contains 1:1 mail and quite a bunch of newsletters (which I registered for). Also all bounces and stuff like that goes directly into subfolders and is therefore **not** in my corpus. I could put much more ham in if I dig deeper into my archive folders but I thought too old mail is not good for masschecks. For spam corpus I delete everything older than 30day from corpus before running masscheck. Cheers and have a good one tobi
OpenPGP_signature
Description: OpenPGP digital signature
