On Wed, 5 May 2021, [email protected] wrote:
Hello
I'm new to masscheck, nothing uploaded yet, and have two questions
Welcome aboard!
As my spam corpus comes from my traps and my ham "just" from my personal
addresses there is quite an imbalance between my spam- and ham corpus
(300 ham and several k's of spam). Is such an imbalance a problem for
reliable masscheck?
"Reliable"? No, the balance doesn't affect reliability. What affects
reliability is the accuracy of the classification of the messages in your
corpora - ham really needs to be *ham*. Misclassification has a greater
impact than a poor ratio. Spend some time making sure it's correctly
classified.
That said, what we really need is ham in non-English languages. If there's
any way you can get more good (accurately classified) non-English ham,
that would be the greatest benefit.
Your masscheck corpora don't leave your machine, only the rule hit stats
get uploaded, so it's not a potential privacy violation (or not much of
one). Do you know anyone (perhaps family members) who would trust you with
a copy of their ham emails to add to your corpus?
Is your ham corpus limited to what you've used to train Bayes? Or do you
really get that little email? Put more in. About the only
properly-classified ham I *wouldn't* put into masscheck corpora would be
emails discussing spam (e.g. the SA users list is a big no-no).
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
[email protected] pgpk -a [email protected]
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Are you a mildly tech-literate politico horrified by the level of
ignorance demonstrated by lawmakers gearing up to regulate online
technology they don't even begin to grasp? Cool. Now you have a
tiny glimpse into a day in the life of a gun owner. -- Sean Davis
-----------------------------------------------------------------------
3 days until the 76th anniversary of VE day