[Bug 5896] New: try out enemieslist

bugzilla-daemon Thu, 01 May 2008 04:07:37 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896


           Summary: try out enemieslist
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Rules
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]


Steven Champeon has been in touch regarding 'testing my enemieslist rDNS
patterns data against the SpamAssassin spam/ham corpus(es) to see if there's a
reason for us to collaborate.'

I think this could be very useful.

he says:

'As you may or may not know, enemieslist is my dataset of regular
expressions of rDNS naming conventions, classified by various things
like assignment type/duration (dynamic/static/generic provider-assigned,
and so forth), tech in use (cable/dialup/dsl/wireless/etc), and also by
resnet (.edu residential networks), webhost (mass virtual hosting), and
the like. More can be found here:

 http://enemieslist.com/how/use.html

The basic idea is that EL generic/dynamic/static pats are often bots;
webhost suggests higher risk of phish attacks; outmx suggests that an
outright rejection might be ill-advised, and so forth for the other
classifications. The stats differ between classifications and for PTR as
opposed to HELO; generic HELO of most types often indicates bots,
whereas dynamic/generic PTR is merely suggestive but useful in a scoring
context in my experience with the sendmail package I developed that uses
the EL data. There are currently almost 29K patterns in the dataset. I
ran a list of 100K known Storm bot IPs against it a few weeks ago,
courtesy Randy Vaughn at Baylor, and EL matched > 99.998% of those that
had rDNS. I ran the CBL against it back in late December, and got about
a 94.7% match rate against those IPs that had rDNS. It's pretty
comprehensive. All patterns are fully qualified, and organized by
domain, it's not just a big ugly single regex.

I'm curious to see how incorporating EL DNSBL lookups into SpamAssassin
might be useful; we have a DNSBL mirror network (currently three hosts,
with more on the way) or I can talk about how to use it with a patched
rbldnsd if you wanted to do some local testing. It'd be really
interesting to see how the various classifications compared and how to
best score them (for both PTR and HELO string) as a module in SA. I'm
also looking to see what sort of scaling I'd need to have the DNSBLs
support if we were to introduce an SA module.'

also, in response to a mail from me:

> We already a rudimentary set of the ~20 most common rDNS naming schemes
> for dynamic hosts, but EL sounds a lot more exhaustive, and I suspect
> there'll be good correlation between EL rules and other rules in our
> ruleset.  It should be quite easy to figure that out.

OK, sounds good. I'm really interested in seeing what the various FP
rates would be for both the HELO and PTR for the various return values;
I'm also interested in seeing what rates are for the different
subclasses (as formed by the combination of A response and TXT response
for the same lookup, so "static/cable" or "dynamic/dsl" or
"natproxy/vpn"). Basically, I'm using these today as very blunt hammers,
and I want to make sure I have a good sense of how to better tune the
scoring. And you guys have such great stats, so I came to you :)

> So, these are generally run against the SMTP connecting host's
> rDNS, right?

Both PTR and HELO/EHLO string, yes. We've found that PTR is a good
indicator, but when the HELO string is a match for some EL pattern it's
a very reliable indicator of bot activity with a very low FP rate, so we
test both when available. Of course, this differs between the various
types, so I wouldn't assume webhost or outmx or static PTR are
necessarily bad, just indicative. But we'll see what the numbers
look like after we run some tests, I suppose :)

> By the way, do you mind if we conduct this conversation on a public
> Bugzilla entry?  that's generally how we do it.  Doing that in the
> open is also more likely to get useful info on how other hosts
> have found the increased load from SpamAssassin lookups, too.

No, not at all, though I definitely want to know how adding this to
SA would affect our load; and give me time to throw a few more rbldnsd
mirrors into the rotation if required. (Running lookups against the
patterns is very fast, 75K/s here on my macbook, but once you add
logging and DNS overhead it slows down considerably :-/)

So, what next? Should we look at setting up a local rbldnsd instance
to isolate testing from our production machines? Was the doc I sent
a URL for in my last email sufficient to tweak whatever SA rules
you need to test? I'm here to answer any questions you have :)



Anyway, usage details are here:  http://enemieslist.com/how/use.html -- we'd
need to add some rules to do this.  I've been meaning to do this for several
weeks(!) but things have been busy :( so here's a new ticket.


-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] New: try out enemieslist

Reply via email to