Re: Google Summer of Code

Mark Martinec Mon, 23 Mar 2009 11:48:04 -0700

> I think there may still be a meta bug in the bugzilla... worth
> checking it for ideas.


All I could find was:
  https://issues.apache.org/SpamAssassin/show_bug.cgi?id=4917
but is empty and closed.

Some ideas can be found as enhancement requests in the bugzilla.


Here are some other that come to mind:

- 'a bugathlon': there are many bugs open, and some of these are
rather small things to fix. Some may even be just forgotten and
already fixed. It would be nice to go systematically through the
list, doing some triage, and fix the more straightforward ones.

- the M::SA::Message::Metadata::Received::parse_received_line
looks like one big ad-hoc mess of exceptions. I'd dreamed that
making a general (but permissive) parser of the syntax as
prescribed in RFC 2821 could cover 2/3 of the cases, then
dealing with the remaining exceptions.

- there is a basic IPv6 support in SA, but seems like there are
several corner cases where IPv6 addresses are not recognized or
supported. Likely (just guessing) in RBL lookups, in Received header
field parsing, some DNS lookups in plugins, querying for AAAA in
addition to A, and in .ip6.arpa for reverse queries, maybe in
spamc/spamd. It would be nice to go systematically across features,
checking or fixing their IPv6 support.

- my personal pet peeve: cleanly separating checking of a message
from score generation and from reporting. This would make it possible
(when using SA at a MTA level) to run a multi-recipient message
through checks once, then produce a per-recipient score and/or
per-recipient report individually for each recipient without having
to re-run the rules. Most rules are already compatible with this:
checking could just collect the set of rule names that fire, and
assigning and summing up scores could be done as a separate step.
Missing details are excluding rules which have zero score for all
recipients of a message, short-circuiting, per-recipient bayes.
Some stats indicate that a message has 1.5 recipients on the average,
which means saving 50% of time almost for free when running in the
MTA integration mode, while still preserving many per-recipient features.

- dealing with arbitrary size mail messages: the rules and plugins
which need it could have access to a complete message kept on a file
(like checking DKIM signatures, processing of large attached pictures
or documents, ...), while the rest can continue to work with an
in-memory copy, but truncated to a managable size if necessary.
The spamc could for example pass a file name to spamd (when both
are running on the same host), instead of having to feed mail contents
through a pipe/socket.

  Mark

Re: Google Summer of Code

Reply via email to