On Wed, 5 May 2021, [email protected] wrote:

John,

On 5/5/21 4:55 PM, John Hardin wrote:

That said, what we really need is ham in non-English languages. If
there's any way you can get more good (accurately classified)
non-English ham, that would be the greatest benefit.

my ham is mostly, would say about 90%+, German.

Yay!

Spam is mostly English but also quite some Italian, Spanish and French. From time to time Russian or Chinese :-)

Do you know anyone (perhaps family members) who would trust you with a
copy of their ham emails to add to your corpus?

sure there are but I'm not so sure that their judgement related to
spam/phish can be trusted without massive manual intervention ;-)

That is certainly part of it if anyone other than you is contributing to the corpora. You need to verify the correct classification of the messages they provide. It's just like vetting Bayes training messages (FPs and FNs) provided by users if you're an admin.

Is your ham corpus limited to what you've used to train Bayes? Or do
you really get that little email? Put more in. About the only
properly-classified ham I *wouldn't* put into masscheck corpora would
be emails discussing spam (e.g. the SA users list is a big no-no).

my ham is what ended in my inbox and has not been sorted out as spam.

Ah. It's a BAD idea to train Bayes from or run masschecks directly against your inbox, because if you happen to fall behind for any reason then spams may be learned/scanned as ham.

It's better to set up separate email folders for messages that you have actually seen and confirmed as ham, then train/masscheck those folders.

Most of the mails I got daily are from mailinglists but those get
automoved (thanks to sieve) into subfolders which do not end in my ham
corpus. My inbox contains 1:1 mail and quite a bunch of newsletters
(which I registered for). Also all bounces and stuff like that goes
directly into subfolders and is therefore **not** in my corpus.

If you know it's ham, it should be in your corpus. (Except, again, for something like the SA users list where we discuss spam signs and post examples, and things like non-delivery notices if you get backscatter.)

I could put much more ham in if I dig deeper into my archive folders

Good!

but I thought too old mail is not good for masschecks. For spam corpus I delete everything older than 30day from corpus before running masscheck.

No. Ham a couple of years old is still useful, as the character of ham changes much more slowly than it does for spam.

The masscheck process has inherent - and different - age limits for ham and spam corpora, coded into the distributed script. Let those limits take care of it and feed it whatever you can get. I wouldn't *manually* filter by date until it's five years old, and that's only to reduce the amount of stuff the script needs to discard.

Cheers and have a good one

Likewise!

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 [email protected]                         pgpk -a [email protected]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
 3 days until the 76th anniversary of VE day

Reply via email to