John,

On 5/5/21 4:55 PM, John Hardin wrote:
>
>
> That said, what we really need is ham in non-English languages. If
> there's any way you can get more good (accurately classified)
> non-English ham, that would be the greatest benefit.

my ham is mostly, would say about 90%+, German. Spam is mostly English
but also quite some Italian, Spanish and French. From time to time
Russian or Chinese :-)


>
> Do you know anyone (perhaps family members) who would trust you with a
> copy of their ham emails to add to your corpus?

sure there are but I'm not so sure that their judgement related to
spam/phish can be trusted without massive manual intervention ;-)


>
> Is your ham corpus limited to what you've used to train Bayes? Or do
> you really get that little email? Put more in. About the only
> properly-classified ham I *wouldn't* put into masscheck corpora would
> be emails discussing spam (e.g. the SA users list is a big no-no).

my ham is what ended in my inbox and has not been sorted out as spam.
Most of the mails I got daily are from mailinglists but those get
automoved (thanks to sieve) into subfolders which do not end in my ham
corpus. My inbox contains 1:1 mail and quite a bunch of newsletters
(which I registered for). Also all bounces and stuff like that goes
directly into subfolders and is therefore **not** in my corpus.

I could put much more ham in if I dig deeper into my archive folders but
I thought too old mail is not good for masschecks. For spam corpus I
delete everything older than 30day from corpus before running masscheck.

Cheers and have a good one


tobi




Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to