Re: collecting corpora

2009-08-27 Thread Mark Martinec
>  One useful factor of ham is that it's not time-sensitive; a mail that
>  was ham in 2003 would still be ham today.  So we can collect old ham
>  mail archives, or submissions of relatively old mail, if necessary.
> >>>
> >>> This may be a false assumption.  A spamvertised or spam sending
> >>> domain from 2003 could have expired and been re-registered by
> >>> a different organization.  Same for ham.  Both ham and spam
> >>> should have expiration times.  1 year would probably be good,
> >>> since spamvertised domains probably don't get renewed.
> >>
> >> yep, I was talking with a SURBLer about this last week I think.  we
> >> should probably add meta conditions ot the URIBL ruleset to ensure
> >> they don't fire at all on old messages.
>
> if we had enough ham to get useful results with that limit, sure.  As
> it is, I'm not sure that's the case.

Btw, I just came across this article (from CEAS 2009):

Jose-Marcio Martins da Cruz, Gordon V. Cormack:
  Using old Spam and Ham Samples to Train Email Filters

http://www.j-chkmail.org/ceas/ceas09-gvcjm.pdf


  Mark


[Bug 6114] SpamCop top spammers and top spamming networks

2009-08-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #16 from Warren Togami   2009-08-27 07:57:33 
PST ---
Ultimately, isn't this just a really tiny DNSBL?  That might be more
appropriate than a rule that must be updated very frequently.

Also, how does the source of this data feel about us copying it?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6114] SpamCop top spammers and top spamming networks

2009-08-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #15 from Justin Mason   2009-08-27 07:44:39 PST ---
(In reply to comment #14)
> > Keep in mind that this is using data that is 57 days old (May 19, new 
> > version
> > attached) for a data set that is very time-specific.  You can see this 
> > impact
> > in the hit-rate over time graph, best illustrated by KHOP_SC_TOP_CIDR8, 
> > http://tinyurl.com/ksc3wa  (that's a shot of what it looks like now) - there
> > were almost zero hams on May 19, but the hams spiked up a week later and 
> > again
> > for this week.  Who's to say that the problematic entries were present at 
> > those
> > times?  We know only that the ham count was best on the day it was released.
> 
> Do we have a means to automatically update these rules on a regular basis?

not unless Adam fancies getting himself an SVN commit bit ;)

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6114] SpamCop top spammers and top spamming networks

2009-08-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #14 from Warren Togami   2009-08-27 06:56:11 
PST ---
> Keep in mind that this is using data that is 57 days old (May 19, new version
> attached) for a data set that is very time-specific.  You can see this impact
> in the hit-rate over time graph, best illustrated by KHOP_SC_TOP_CIDR8, 
> http://tinyurl.com/ksc3wa  (that's a shot of what it looks like now) - there
> were almost zero hams on May 19, but the hams spiked up a week later and again
> for this week.  Who's to say that the problematic entries were present at 
> those
> times?  We know only that the ham count was best on the day it was released.

Do we have a means to automatically update these rules on a regular basis?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


fwd: False Detection

2009-08-27 Thread Justin Mason
oops, accidentally bounced this mail from a non-subscriber...

Thorsten: SpamAssassin does not operate a blacklist, so we can't remove
you from anything.  If you can send along a message modified by
SpamAssassin, which includes the reason that the recipient's version of
SpamAssassin marked it as spam, we may be able to help figure out why and
provide useful advice.  for what it's worth, I can't see any issues with
those domains when I run your mail through SpamAssassin here.

http://wiki.apache.org/spamassassin/AvoidingFpsForSenders may also be
useful.

--j.

From: "Benz, Thorsten" 
To: "secur...@spamassassin.apache.org" 
Date: Thu, 27 Aug 2009 11:16:19 +0200
Subject: False Detection

Hello,

You are detecting our Email Addresses technidata.com/technidata.de as SPAM.
We are definitely sending NO spam emails.

Could you please remove us from your blacklist?

With kind regards

Thorsten  Benz

TechniData AG
Environmental Compliance Solutions
Thorsten Benz
BITS
Dornierstr. 3
88677 Markdorf, Germany
Tel. +49 (0) 75 44 / 9 70-3 26
Fax +49 (0) 75 44 / 9 70-1 11 3 26
mailto:thorsten.b...@technidata.com
http://www.technidata.de
Events: www.technidata.de/events
_

TechniData is the leading provider for environmental compliance
_
The information contained in this message may be CONFIDENTIAL
and is intended for the addressee only. Any unauthorised use,
dissemination of the information or copying of this message is prohibited.
If you are not the addressee, please notify the sender immediately
by return e-mail and delete this message.  Thank you for your cooperation.
We do not accept any warranty concerning errors, viruses,
interception or interference.

Sitz/Office: Markdorf
Vorstand/Executive Board:
J=FCrgen Schwab (CEO), Thomas Wienke (COO), Dr. Thomas Wrede (CFO)
Vorsitzender des Aufsichtsrats/Chairman of Supervisory Board: Kurt Kaiser
Amtsgericht/District Court: Freiburg im Breisgau, HRB 581415
_




[Bug 6155] generate new scores for 3.3.0 release

2009-08-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #20 from Justin Mason   2009-08-27 03:52:29 PST ---
(In reply to comment #19)
> I feel like we have too little diversity in the type and number of ham
> contributors.  This rescoring would be a big improvement from our scores from
> two years ago and we definitely should do it.

yes.

> But after 3.3.0 I would like to learn how I can become more involved in order
> to revamp the score update process.
> 
> * I'd like to learn how to operate the GA.
> * I want to continue recruiting other nightly masscheck participants.  I want
> to recruit contributors of non-English languages and non-technical users. 

Great!  As long as they keep the ham out of the spam and vice versa, and we can
occasionally get in touch for eyeball-verification of odd-looking FPs, that'll
be very useful ;)

> * I am thinking about writing a toolkit (in RPM and DEB packages) that would
> make it easier for participants to join masschecks.  The current documented
> process is very unclear and confusing, and I want to clean this up as well.

It certainly is.

We've been meaning to improve this for several _years_ now, but it's never been
a high enough priority.  mass-check is very dev-oriented, and it should be
something bundled (and documented) at a similar level to the sa-compile or
sa-update scripts.

Here's history on the historical attempts which ran out of steam halfway
through:

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3096
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=2853

BTW please ensure that changes in SA (which there will definitely need to be)
are submitted back upstream; IMO this functionality should be part of the core
package. ;)

> With more diversity in masscheck participants, perhaps we can do complete
> rescoring more often than 2 years.

Yes.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


Re: [auto] bad sandbox rules report

2009-08-27 Thread Justin Mason
ah, noted.

--j.

On Thu, Aug 27, 2009 at 01:23, Warren Togami wrote:
> On 08/26/2009 04:31 AM, Rules Report Cron wrote:
>>
>> rulesrc/sandbox/jm/20_khop_sc_bug_6114.cf (10 rules, 10 bad):
>>
>>   KHOP_SC_CIDR16:  no hits at all
>>   KHOP_SC_CIDR24:  no hits at all
>>   KHOP_SC_CIDR8:  no hits at all
>>   KHOP_SC_TOP10:  no hits at all
>>   KHOP_SC_TOP100:  no hits at all
>>   KHOP_SC_TOP20:  no hits at all
>>   KHOP_SC_TOP200:  no hits at all
>>   KHOP_SC_TOP_CIDR16:  no hits at all
>>   KHOP_SC_TOP_CIDR24:  no hits at all
>>   KHOP_SC_TOP_CIDR8:  no hits at all
>
> khopesh in #spamassassin mentioned that these rules in the sandbox broke a
> few weeks ago when the sandbox moved.  He hasn't had time to follow up.  I
> don't know the details myself.
>
> Warren
>
>



-- 
--j.


[Bug 6114] SpamCop top spammers and top spamming networks

2009-08-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #13 from Justin Mason   2009-08-27 02:20:33 PST ---
Warren Togami to SpamAssassin (8 hours ago)

On 08/26/2009 04:31 AM, Rules Report Cron wrote:

rulesrc/sandbox/jm/20_khop_sc_bug_6114.cf (10 rules, 10 bad):

  KHOP_SC_CIDR16:  no hits at all
  KHOP_SC_CIDR24:  no hits at all
  KHOP_SC_CIDR8:  no hits at all
  KHOP_SC_TOP10:  no hits at all
  KHOP_SC_TOP100:  no hits at all
  KHOP_SC_TOP20:  no hits at all
  KHOP_SC_TOP200:  no hits at all
  KHOP_SC_TOP_CIDR16:  no hits at all
  KHOP_SC_TOP_CIDR24:  no hits at all
  KHOP_SC_TOP_CIDR8:  no hits at all


khopesh in #spamassassin mentioned that these rules in the sandbox broke a few
weeks ago when the sandbox moved.  He hasn't had time to follow up.  I don't
know the details myself.

Warren

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.