On 8/13/2012 3:28 AM, Daniel Lemke wrote:
There doesn't actually exist a wiki article for the ruleqa list, does it?
Not yet, no.
I hope it's ok to continue posting questions regarding the mass check setup 
stuff here, otherwise please let me know so I can move the discussion to users 
or dev list.
Mass check is definitely in the ruleqa list purpose. However, RuleQA is a bit broader than just masscheck and hopefully will expand in the future.

So back to topic: I think the script is basically running now, but I've got 
some more questions before moving on:

The CorpusCleaning article in the Wiki (the one I shall read before continuing 
;-)) says, one must not use data that has been collected from third-party 
accounts.
Does that mean, I should only feed the corpus with mails from my own personal 
mail account, or is it ok if I also add mails from our sales, marketing, ..., 
departments?
Those departments are relatively small as our company is and I'd call them a 
trustfully source as I personally 'advise' our personal in handling their mail 
boxes.
If the data is hand sorted and trustable, I would say a resounding yes. The point we are trying to make is something I see a lot. I have users ALL the time report XYZ mailing list as spam.

However, there are companies that are clearly not sending unsolicited mail yet we'll see these mailing lists reported as spam.

This is because end users by practice consider spam filters as more of a "what I want to see in my inbox" more so than a purely unsolicited emails.

It sounds to me like you'll explain what this means to the people helping and it should lead to a great source of corpora!
Secondly, I'd like to know in what frequency the corpus shall be fed with fresh 
data. Is it sufficient to do this once a week or does the nightly masscheck 
rely on fresh data on a daily basis?
As often as you can but we'll take what we can get. Week old corpus data is still very useful and we use many years for ham data. Plus spammers would just recycle their tricks if we decided to arbitrarily not use old spam.
Small side note:
Point 11 on the NightlyMassCheck article says you need to check the ham-*.log 
and spam-*.log in the ~/masscheckwork/nightly_mass_check/
I found them in ~/masscheckwork/nightly_mass_check/masses, small mistake in the 
guide?
Likely, yes.  I'll edit it!

Regards,
KAM

Reply via email to