RE: TF wish list

Randy Brukardt Wed, 19 Nov 2003 22:12:16 -0500

Chairman said, replying to me:

> i was thinking just an address to forward spam to. if it comes through the
> server, count it good, if it hits that box, count it double bad..


I've done that since I started using TF. For now, I've been hand-analying
the messages that I send back to junk, but the next version of TF will
(semi-)automate that.

There's no real reason that these functions couldn't be completely
automated, but I just don't like getting junk into the
filters/dictionary/etc. YMMV.

> not really... all the other crap we are doing to filter it uses a lot
> of proc too.

The problem is the design of plug-ins for IMS. Each execution requires
reloading everything. A Bayesian filter database is going to be very large
(or is going to require sophisticated filtering somewhere) because of all of
the gibberish in current spam. Loading it all from disk each time it is used
is going to be prohibitive, and leaving it on disk is going to make it
pretty slow and possibly rather disk intensive. (There will be a lot of
lookups for the typical message.)

Your mail client filter doesn't have to deal with that, because it can load
the DB once and forget it until you close the client. (Or more likely, write
it out once in a while.)

The dictionary check is much more practical in this environment, because the
dictionary is much smaller, so loading the whole thing each time isn't too
expensive.

Of course, replacing SCSMFilter completely with a TF service is possible,
but that's a bigger job than I want to undertake right now. (Admittedly, it
would be a good idea from a resource usage standpoint.)

> >(I read an article last year that said that the Bayesian filter equations
> >are numerically unstable. The net effect is that such filters work more
> >because of the techniques for deciding what to include and what to ignore
> >(the "art" part of them) than the science of the probability equations.
That
> >implies that they work (if they work at all) because of the skill of the
> >programmer and because of the stupidity of most spammers, not because of
any
> >real edge over other types of filters.)
>
> smoke and mirror bullshit.
> go try eudora pro 6 for a week and you'll see... and that's a very simple
> version of the beyesian filter.

No, its a fact. The problem is that the more detail you use in a Bayesian
filter, the worse it works. (The paper did a bunch of a experiments which
showed that in practice as well.) That to me says that its other factors
than the filter algorithm itself that is making it work so well. The "very
simple filter" in Eudora Pro is probably as good as such a filter can be
written. Essentially, you have to score a message on a carefully selected
(small) set of items. The quality of the filter depends critically on how
that set is constructed, and I'd expect spammers to do much better at
evading them over time.

After all, reason that spam is so easy to catch is that most spammers are
dumb and send the same stuff over and over. Even a simple-minded filter can
get a lot of it. Only about 0.2% of the spam I receive here ends up getting
through my Trash Finder filters (augmented with SpamCop's blocklist). No
filter is likely to do much better than that. My main concern with TF going
forward is simply to make it much easier to maintain, and to improve
performance by getting stuff out of registry.

TF would be a lousy platform for a Bayesian filter anyway. The point of the
article that I mentioned above is that the reason that a traditional
Bayesian filter works is that it takes the raw message and sticks it into a
blender. By the time TF is even thinking about filtering, it has already
decoded the message, separated it into its parts, cleaned junk out of the
headers, etc. A lot of the stuff that Bayesian filters trigger on is gone by
the time TF is ready to start filtering. It would of course be possible to
put the message back together (the TF Viewer does that, for instance), but
it still wouldn't have quite everything that the whole thing has.

In any case, the biggest appeal of a Bayesian filter is that it is
automated. But I'll never trust any fully automated filter to delete
incoming mail, and TF already traps virtually all of the spam anyway (and
SpamCop gets much of the rest). What I need is to be able to delete more
spam without having to look at it. So it makes the most sense to concentrate
on making TF easier (including possible full automation and filter sharing
if there is sufficient demand). (More pass/no_delete overrides would help,
as well, as it would allow making the other filters more aggressive.)

                Randy.

This is the discussion list for the IMS Free email server software.
  To unsubscribe send mailto:[EMAIL PROTECTED]

            Delivered by Rockliffe MailSite
           http://www.rockliffe.com/mailsite
                Rock Solid Software (tm)

RE: TF wish list

Reply via email to