On 8/13/2012 3:28 AM, Daniel Lemke wrote:
There doesn't actually exist a wiki article for the ruleqa list, does it?
Not yet, no.
I hope it's ok to continue posting questions regarding the mass check setup
stuff here, otherwise please let me know so I can move the discussion to users
or dev list.
Mass check is definitely in the ruleqa list purpose. However, RuleQA is
a bit broader than just masscheck and hopefully will expand in the future.
So back to topic: I think the script is basically running now, but I've got
some more questions before moving on:
The CorpusCleaning article in the Wiki (the one I shall read before continuing
;-)) says, one must not use data that has been collected from third-party
accounts.
Does that mean, I should only feed the corpus with mails from my own personal
mail account, or is it ok if I also add mails from our sales, marketing, ...,
departments?
Those departments are relatively small as our company is and I'd call them a
trustfully source as I personally 'advise' our personal in handling their mail
boxes.
If the data is hand sorted and trustable, I would say a resounding yes.
The point we are trying to make is something I see a lot. I have users
ALL the time report XYZ mailing list as spam.
However, there are companies that are clearly not sending unsolicited
mail yet we'll see these mailing lists reported as spam.
This is because end users by practice consider spam filters as more of a
"what I want to see in my inbox" more so than a purely unsolicited emails.
It sounds to me like you'll explain what this means to the people
helping and it should lead to a great source of corpora!
Secondly, I'd like to know in what frequency the corpus shall be fed with fresh
data. Is it sufficient to do this once a week or does the nightly masscheck
rely on fresh data on a daily basis?
As often as you can but we'll take what we can get. Week old corpus
data is still very useful and we use many years for ham data. Plus
spammers would just recycle their tricks if we decided to arbitrarily
not use old spam.
Small side note:
Point 11 on the NightlyMassCheck article says you need to check the ham-*.log
and spam-*.log in the ~/masscheckwork/nightly_mass_check/
I found them in ~/masscheckwork/nightly_mass_check/masses, small mistake in the
guide?
Likely, yes. I'll edit it!
Regards,
KAM