Re: svn commit: r169334 - in /spamassassin/trunk: MANIFEST lib/Mail/SpamAssassin/Conf.pm lib/Mail/SpamAssassin/HTML.pm lib/Mail/SpamAssassin/PerMsgStatus.pm lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm lib/Mail/SpamAssassin/Util.pm rules/20_uri_tests.cf t/uri.t t/uri_html.t

2005-05-09 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Theo Van Dinter writes:
 Sorry to be a killjoy here.
 
 On Mon, May 09, 2005 at 03:55:07PM -, [EMAIL PROTECTED] wrote:
  +  my $redirector_patterns = $self-{conf}-{redirector_patterns};
  +  @uris = 
  Mail::SpamAssassin::Util::uri_list_canonify($redirector_patterns, @uris);
 
 Do we consider uri_list_canonify() to be a public function?  If so, there
 needs to be some form of backward compatibility maintained.  Since there
 seems to be no POD for Util.pm at all, one could read that to mean it's
 all considered private, but we never did finish going through and changing
 the private function names so it's not clear now.

I don't think it's a particularly public API, since it's called
already by public APIs.  in other words I wouldn't worry about
it too much.

As for the rest -- let's just check it in and hack away as Sidney
suggested. ;)

- --j.

  -  my @parsed = $scanner-get_parsed_uri_list();
  +  my @parsed = $scanner-get_uri_list();
 
 -0.7
 
 URIBL now considers the full list of URIs as having been parsed out of
 the rendered text, which messes up the priority levels somewhat.  A case can
 be made that the higher priority domains will already be on the list, but it
 does mean more work for the plugin (more URIs to go through).
 
 If we're not changing get_uri_list() (or more likely making a new
 function) to return a combined uri_detail-esque dataset, then I'd like
 to see get_parsed_uri_list() left alone (ie: let it do the canonification
 and get_uri_list() can skip doing it), and just add a call into URIBL to
 get_uri_list (we don't care about the output) to do the canonification
 of the HTML bits.
 
 Arguably, we could just have a new _canonify_uri_detail() call in
 PMS and avoid the rest of the get_uri_list() stuff, but ...  We know
 get_uri_list() is being called elsewhere anyway, so it's not a big deal IMO.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCgCxiMJF5cimLx9ARApgKAJ99n5wSZoEJB+AI9qHEZBmd46KVkgCfaPOg
sNXOp1xY1o04TA6c3VQfhmE=
=9Jfb
-END PGP SIGNATURE-



Re: new tool for website -- versioned configuration reference

2004-09-07 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 [EMAIL PROTECTED] (Justin Mason) writes:
 
  I'm not entirely sure how useful this is -- on one hand, it's
  a great way to point to a single configuration setting.  OTOH,
  it's another thing that can break, so I want to make sure 1
  people think it's useful first ;)
 
 Interesting.
 
 Why not have different subdirectories for each version?  Easier and less
 apt to break.  Maybe have a version selector from a menu like CPAN.

That would be easier to do.  Probably a good idea, sth like:

/ref/3.0.x/use_auto_whitelist.html

displays 3.0.x doco

/ref/2.6x/use_auto_whitelist.html

displays 2.6x doco

/ref/use_auto_whitelist.html

displays latest stable doco, 3.0.x

and pages have the selector to display other versions, if they exist.

 Also, does this handle the configuration documentation changing slightly
 from version to version?  They all appear to be the same.

It just uses the latest, if the =item exists in 1 version.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBPiV0QTcbUG5Y7woRAtO8AJ4/1Dcsz5t9EqXRrfiLg8vCAW9KGgCgxApE
ulifaboTB2wp6IHUZF01S0c=
=7/zG
-END PGP SIGNATURE-



Re: new tool for website -- versioned configuration reference

2004-09-07 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 [EMAIL PROTECTED] (Justin Mason) writes:
 
  That would be easier to do.  Probably a good idea, sth like:
  
  /ref/3.0.x/use_auto_whitelist.html
 
 I'd strongly advocate providing versioned links for the current
 documentation *first* and afterwards we can worry about option links.
 
   /ref/3.0.0/Mail::SpamAssassin::Conf.html

what, like what's already there?

http://spamassassin.apache.org/doc.html

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBPi0vQTcbUG5Y7woRArfPAJ49MqJDiNPDx/n8CE4gWPFBTfojfACgkevf
GCHApFlVz1rh1Ze/MN73QgA=
=s3s9
-END PGP SIGNATURE-



Re: Why does SA 3.0 require Perl 5.6.1?

2004-09-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Barry Jaspan writes:
 The two main things I seem to remember are that perl 5.6.1 fixes a bunch
 of bugs from 5.6.0
 
 If anyone can remember what bugs were fixed that affect SpamAssassin, I'd 
 appreciate it.

if I recall correctly, all of them were ExtUtils::MakeMaker-related...
some crawling through the bugzilla ;) would probably show up the
details.

- --j.

 FYI: MacOS X default perl has several issues.
 
 No kidding.  :-)  Luckily, the setuid() and setgid() problem is not an 
 issue for me.
 
 Thanks,
 
 Barry
 
 Note:  This message was dictated using voice recognition software.  Please 
 excuse any errors I missed.
 
The main (and rather
 disturbing) one that I can recall is that it has absolutely no support
 for setuid() and setgid(), which we came across when working on spamd.
 
 --
 Randomly Generated Tagline:
 Well, we're safe for now.  Thank goodness we're in a bowling alley.
   - From the movie Pleasantville
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBQPG6QTcbUG5Y7woRAqxjAKDZJupAdgrRaGb2WOQUc3kzSECrGgCfcBGr
VDdaNiiR0eRHOObYDwhWLkg=
=4hoR
-END PGP SIGNATURE-



Re: BAYES_* scores - non-monotonic?

2004-09-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Hi Alan!

Alan Schwartz writes:
 Lately, running SA 3.0.0 with no rule or score tuning, I have been
 noticing that my false negatives tend to have BAYES_99 matched.
 
 The scores file lists the following scores for Bayes:
 
 50_scores.cf:score BAYES_00 0 0 -1.665 -2.599
 50_scores.cf:score BAYES_05 0 0 -0.925 -0.413
 50_scores.cf:score BAYES_20 0 0 -0.730 -1.951
 50_scores.cf:score BAYES_40 0 0 -0.276 -1.096
 50_scores.cf:score BAYES_50 0 0 1.567 0.001
 50_scores.cf:score BAYES_60 0 0 3.515 0.372
 50_scores.cf:score BAYES_80 0 0 3.608 2.087
 50_scores.cf:score BAYES_95 0 0 3.514 2.063
 50_scores.cf:score BAYES_99 0 0 4.070 1.886
 
 I realize that these scores come out of the automated algorithm,
 but they are not sensible on their face, and suggest a potential
 problem with the Bayesian classifier's operation or the mass
 check.
 
 Note that even without network tests, BAYES_95  BAYES_80, BAYES_60
 With network tests, BAYES_05 is  BAYES_20, BAYES_40, and 
 BAYES_99  BAYES_95  BAYES_80.
 
 It would not be unreasonable to constraint the BAYES_* scores
 so that they are always monotonic in the predicted probability of
 spam. This constraint would likely cause the scores associated with
 other rules to change slightly, but might not reduce the overall
 accuracy of SA in the mass check corpus (perhaps you're in some
 kind of local minimum?)

Yeah, we've noticed that  -- if I recall correctly, generally it
*doesn't* seem to work out better to constrain them; possibly
because the BAYES_99 spam is already hitting many other rules.
The score generation tries to minimise rule scores without
losing hits, to avoid FPs having major effects.

I think we tried locked-down BAYES scores, and found *lower* overall
accuracy figures.

I'm not certain, though...

- --j.

 I hope this makes sense. I'd be very interested in hearing about
 other experiences with this.
 
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Alan Schwartz [EMAIL PROTECTED]
 Author/Co-author of: Managing Mailing Lists, SpamAssassin, 
 Stopping Spam, and  Practical Unix  Internet Security, 3rd Ed
Published by O'Reilly Media, Inc. (http://www.oreilly.com)
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBQQDIQTcbUG5Y7woRAit3AKDqtZpmU+8sOJOM7if0uBpqcR3eZgCfTJhN
hwCJk16py5hr7wNEsL1U6OI=
=kcP1
-END PGP SIGNATURE-



Re: svn commit: rev 43640 - spamassassin/trunk

2004-09-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


[EMAIL PROTECTED] writes:
 +If you use Debian, you can get Storable from the libstorable-perl
 +package.

might be better to just have:

Debian: apt-get install libstorable-perl
Fedora: apt-get install perl-Storable

since I can see other OS equivalents getting added there...

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBQRfwQTcbUG5Y7woRAozZAKCPNkTwav0pQn+VHlK6mZ4BQuIK/ACeJX8n
2v9CUwfhSAuy5xN+koWLlaA=
=GFJI
-END PGP SIGNATURE-



Re: svn commit: rev 43688 - spamassassin/trunk/spamd

2004-09-13 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Isn't the SYNOPSIS used for spamd's usage message?   If so, I 
think removing the defaults from that message is a bad thing...

- --j.

[EMAIL PROTECTED] writes:
 Author: mss
 Date: Fri Sep 10 13:11:49 2004
 New Revision: 43688
 
 Modified:
spamassassin/trunk/spamd/spamd.raw
 Log:
 POD fix:  Truncated SYNOPSIS lines to 80 chars.  Mostly by removing
 the default values -- they are in the long description of the option
 and adding default values for *all* description would make the
 SYNOPSIS long and complex.  So everybody who wants to know the
 defaults should read the man page.  (I'm thinking about breaking the
 SYNOPSIS into several parts to make it less confusing.)
 
 Modified: spamassassin/trunk/spamd/spamd.raw
 ==
 --- spamassassin/trunk/spamd/spamd.raw(original)
 +++ spamassassin/trunk/spamd/spamd.rawFri Sep 10 13:11:49 2004
 @@ -1912,23 +1912,25 @@
  
   -c, --create-prefs Create user preferences files
   -C path, --configpath=path Path for default config files
 - --siteconfigpath=path  Path for site configs (def: 
 /etc/mail/spamassassin)
 + --siteconfigpath=path  Path for site configs
   -d, --daemonizeDaemonize
   -h, --help Print usage message.
 - -i [ipaddr], --listen-ip=ipaddrListen on the IP ipaddr (default: 
 127.0.0.1)
 - -p port, --portListen on specified port (default: 783)
 - -m num, --max-children=num Allow maximum num children (default: 5)
 - --max-conn-per-child=numMaximum connections accepted by child 
 before exiting
 + -i [ipaddr], --listen-ip=ipaddrListen on the IP ipaddr
 + -p port, --portListen on specified port
 + -m num, --max-children=num Allow maximum num children
 + --max-conn-per-child=numMaximum connections accepted by child 
 +before it is respawned
   -q, --sql-config   Enable SQL config (only useful with -x)
   -Q, --setuid-with-sql  Enable SQL config (only useful with -x,
  enables use of -H)
   --ldap-config  Enable LDAP config (only useful with -x)
   --setuid-with-ldap Enable LDAP config (only useful with -x,
  enables use of -a and -H)
 - --virtual-config-dir=dir   Enable pattern based Virtual configs 
 (needs -x)
 + --virtual-config-dir=dir   Enable pattern based Virtual configs
 +(needs -x)
   -r pidfile, --pidfile  Write the process id to pidfile
 - -s facility, --syslog=facility Specify the syslog facility (default: 
 mail)
 - --syslog-socket=type   How to connect to syslogd (default: unix)
 + -s facility, --syslog=facility Specify the syslog facility
 + --syslog-socket=type   How to connect to syslogd
   -u username, --username=username   Run as username
   -v, --vpopmail Enable vpopmail config
   -x, --nouser-configDisable user config files
 @@ -1938,7 +1940,7 @@
   -D, --debugPrint debugging messages
   -L, --localUse local tests only (no DNS)
   -P, --paranoid Die upon user errors
 - -H dir, --helper-home-dir=dir   Specify a different HOME directory, 
 path optional
 + -H [dir], --helper-home-dir[=dir]  Specify a different HOME directory
   --ssl  Run an SSL server
   --server-key keyfile   Specify an SSL keyfile
   --server-cert certfile Specify an SSL certificate
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBQodkQTcbUG5Y7woRAigbAKDAfa9V1uST0Up50L5SnHzqo04k3gCfb91y
p0vFVDaEiCHqHQXRvLi1Es8=
=eURH
-END PGP SIGNATURE-



class renaming

2004-09-26 Thread Justin Mason
So in thinking about the class cleanup we've been wanting to do for a
while; I think the top items on the list (my list at least) are:

  - rename the Mail::SpamAssassin::PerMsgStatus class

  - break it up into multiple, smaller classes

So here's what I propose for the first one.


Rename the Mail::SpamAssassin::PerMsgStatus class
-

Initially the purpose was as a per-message status object, describing the
results of a scan of one message -- in other words, that message's spam
status -- is it spam or not?. I think we all now agree the name is
not so hot ;)

It's purpose has eventually turned out to be:

  - (public) methods that actually cause the scan to happen
  - (public) the results of a scan operation
  - (public) message rewriting functionality

  - (internally, plugins) state for a scan operation; plugins and our code
can store state on this object for the duration of the scan
  - (plugins) a set of APIs to access aspects of the message being
scanned, and plugin support APIs

  - (internally) methods that implement Eval tests
  - (internally) methods that control how tests are run,
their ordering etc.
  - (internally) methods that implement the DNS event-driven algorithm
  - (internally) methods that perform auto-learning
  - (internally) methods that compile tests into perl bytecode at runtime
  - (internally) methods that parse aspects of the message
  - (internally) the tests compiled as perl bytecode


So in my opinion, in cases like this where there's lots of internal and
external APIs, it's more sensible to name the class after what it's
external APIs do.   (in fact, most OO design would indicate that this
means you need to refactor out into 1 class. I'm getting to that ;)

So, I think Mail::SpamAssassin::Scan is a better name -- the object
returned from M::SpamAssassin::check() is the results of a scan of a
single message.

(Scanner is another poss, but I think Scan is better because we aren't
returning the object that *did* the scan, we're returning the *results* of
the scan.)


The next thing is backwards compatibility.   We can only do this if we
don't break third-party code.   We *can* rename *this* class without
breaking backwards compatibility, thankfully.  Our requirements here are:

- 1. plugins and third-party perl code will very likely contain use
  Mail::SpamAssassin::PerMsgStatus; lines, so having some kind of
  useable file there, is a MUST.

- 2. there are possibly locations in third-party code where a
  Mail::SpamAssassin::PerMsgStatus object is created other than through
  the Mail::SpamAssassin::check() API, so being able to support that is a
  SHOULD.   
  
- 3. However, the majority of callers should not be creating PerMsgStatus
  objects directly, or depending in any way on the object being of
  that specific type.  (hooray: perl's not strongly typed! ;)

Here's how I propose to do that:

  - rename the current Mail::SpamAssassin::PerMsgStatus class to
Mail::SpamAssassin::Scan

  - create a Facade Mail::SpamAssassin::PerMsgStatus object that is a
sub-class of ::Scan, with no additional methods or data.  in other
words, all method calls and member var accesses will fall through into
::Scan.

  - If we deprecate any 3.0.0 APIs in the 3.1.0 cycle, we can move
their backwards compatibility methods into that facade class,
because 3.1.0 code will be Scan-native.

  - keep the facade object around for a while, at least until the next
major cycle, because it's super-cheap; we won't even have a use line
for it in our code, so it'll take up roughly 200 bytes on disk and
that's it.

Sound useful?

in my opinion this is definitely useful in the 3.1.0 tree.




Break it up into multiple, smaller classes
--

OK, part the second.  in my opinion, this is also a very good idea -- as
the XP guys say, PerMsgStatus has a bad code smell -- it's a big file
with lots of totally different functionality mushed into one class.  In
fact, it even loads methods from *multiple* files, which is totally
nasty. ;)

Here's some more details about what APIs are on the ::PerMsgStatus object
(or the ::Scan object as it may be renamed):
 
  - (public) methods that actually cause the scan to happen

check

This should be left as a public API, but its code moved to a new
class.   see (internally) methods that control how tests are run,
their ordering etc below.

  - (public) the results of a scan operation

is_spam
get_names_of_tests_hit / get_names_of_subtests_hit
get_score / get_hits
get_required_score / get_required_hits
get_autolearn_status / get_report
get_content_preview
finish

These are the main thing that the ::Scan object does, so they stay.

  - (public) message rewriting functionality

rewrite_mail

move code into another class; leave this public API on the ::Scan
object 

Re: [Bug 3825] Unescaped '#' in rawbody causes havoc

2004-09-28 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 I think it would be better if we did not allow end-of-line comments and
 required all comments to match:
 
   /^\s*#/
 
 Then comments don't need to be escaped.  I think that would involve less
 surprise and also solves the problem.  I don't think this is purely a
 documentation problem.

That would be a major change in how our configuration files are parsed,
breaking a documented (although not particularly clearly) convention
that's been there since the project began.   It's also inconsistent with
the convention for this configuration file format.

the escaped-hash thing works fine (and I've used it myself at times),
and just needs to be documented.

I'm not keen on that at all: -1

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWJ0BQTcbUG5Y7woRAumpAKCFWgpeXutRdBr63WHWt4RN0XTGJACfRxMi
1NwN6SarUBc4JLWd/825vsM=
=UaZi
-END PGP SIGNATURE-



Re: [Bug 3825] Unescaped '#' in rawbody causes havoc

2004-09-28 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 Justin Mason [EMAIL PROTECTED] writes:
 
  That would be a major change in how our configuration files are parsed,
  breaking a documented (although not particularly clearly) convention
  that's been there since the project began.   It's also inconsistent with
  the convention for this configuration file format.
 
 It would be better if our parser detected invalid lines rather than
 outputting perl errors due to us parsing garbage.  That's my main
 concern.  I am actually fine with requiring # to be escaped.  My main
 concern is the non-clarity of the error statements, documentation does
 not fix that.

ah, gotcha.   sorry, I'm a bit tired today so comprehension's not quite
at full speed.  (just back from Toorcon.)

 (Side note: although not a requirement for this, getting rid of EOL
 comments would make this easier if it was coupled with a requirement
 that # be escaped.)

y'see , I think that's the red herring -- there are many other ways
to screw up the syntax of rules, e.g.

  rawbody HAS_RED_BODY_BG  /body bgcolor=[']/f/i

would similarly produce horrible perlish syntax errors, and there's
no hashes involved there at all.

BTW if you can wait a little bit, I have a patch from McAfee's tree that
does this nicely if I recall correctly -- it catches compile-time errors
in the rules and outputs a decent error message warning of a syntax error
in that rule, by name. ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWKJHQTcbUG5Y7woRAphnAJ4pb0SyEJqrGSCMg20L/LNZQ8o8cACg3R0E
626z2cpHF+QO6m8JQxpdSl0=
=VgYh
-END PGP SIGNATURE-



Re: class renaming

2004-09-28 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Malte S. Stretz writes:
 On Sunday 26 September 2004 10:42 CET Daniel Quinlan wrote:
 [...]
  Do we really need to do this now?  This is not going to significantly
  help performance, accuracy, or memory usage, is it?
 
 As much as I loved to have this thing renamed, why didn't we do this 
 *before* we released 3.0?  Or to quote you from bug 3668: there's *no way* 
 I'd be happy making any of these changes before 4.0.0 ;)  (Actually, the 
 no way is exaggerated but I don't like the idea at this point).

Well, that's a different kettle of fish -- bug 3668 is changing
configuration file paths, this is changing a class name, and ensuring
that backwards compatibility is preserved for that change.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWK8YQTcbUG5Y7woRApXZAJ44uU8QE6pAgG9p6I5BYcsUgnheJACfcrW+
nUz/HYPlrE1qJj3B32nQq7g=
=mbcS
-END PGP SIGNATURE-



Re: [Bug 3821] scores are overoptimized for training set

2004-09-28 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Loren Wilton writes:
  BTW, this is the rule reliability tflag idea again; basically provide a
 way to
  hint that this rule is reliable, and this rule should not be considered
 reliable
  -- no matter what their hit-rates in mass-checks were.
 
  I agree it may have good effects as a hint to the Perceptron, so it may
 now be
  time to do this.  what d'you think, Henry?
 
 Note that Bob M. has a hint comment of his own that gives several levels of
 hint, not just a binary value.  He uses this for his own scoring tool with
 good results.
 
 I think that the idea of a multi-level hint is a good one and should be
 considered.  I don't know if that concept will fit in tflags.  If not,
 perhaps some other (scorehint) could be cconsidered.

yeah -- definitely -- I was thinking that, although I didn't mention
it. ;)   imo a new config command (I was thinking reliability
or similar) would be good.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWMCZQTcbUG5Y7woRAu8KAKDvZuLSPDziv73jJ0vuB6tJckagwQCgk4cI
QtCGKENa11sgPI9zme5ma3M=
=Wvfm
-END PGP SIGNATURE-



Re: class renaming

2004-09-28 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 [EMAIL PROTECTED] (Justin Mason) writes:
 
- (public) message rewriting functionality
  
  rewrite_mail
  
  move code into another class; leave this public API on the ::Scan
  object which calls into that class.
  Proposed class name: Mail::SpamAssassin::Scan::Rewriter?
 
 Rewriting should not be part of the Scan object.

yes, actually, you're right.

 I'd propose that
 rewriting be part of a Mail::SpamAssassin::Format class.

any particular reason for that name?

- (internally) methods that implement Eval tests
  
  [entire contents of EvalTests.pm which
  do this horrible hack of putting themselves into
  the PerMsgStatus namespace]
  
  move code into another namespace.  Eval tests use the
  PerMsgStatus object as $self, and since they're just
  functions, not objects themselves, that doesn't need to
  change -- they'd still get the ::Scan object as their
  first arg.
  
  Proposed namespace: Mail::SpamAssassin::Test::Eval?
 
 Just Mail::SpamAssassin::Tests.pm ?

yeah, actually, why not.

- (internally) methods that control how tests are run,
  their ordering etc.
  
  [parts of check]
  [parts of do_head_tests / etc. ]
  
  Definitely move.
  Proposed class: Mail::SpamAssassin::TestRunner?
  RunTests? Runner? Scanner?
 
 Shouldn't this just be part of Scan?

This is the thing -- as Theo said, by moving to a new class,
we can provide the ability to switch out implementations without
having to change the class of the Scan object (ie. what
gets returned to the user).  

Basically the key idea is that we're breaking it up by *what
it does* and what it's semantics are:

- Scan: object returned to user
- [this class]: object that contains the algorithm and code
  to run whatever subset of the tests in whatever order

And the idea is that all this logic shouldn't be in the
simple results object we give back to the user.

- (internally) methods that implement the DNS event-driven algorithm
  
  [entire contents of Dns.pm which do this horrible hack of putting
  themselves into the PerMsgStatus namespace]
  
  into Mail::SpamAssassin::TestRunner as above?
 
 I'd say this belongs in the EvalTests module, wherever it ends up.

Hmm. not sure about that.

the EvalTests module can be kept for just the eval tests that
are defined; this is plumbing.   In fact, it's more similar to
the TestRunner chunk imo.

There's about 650 lines of code there, too, which is a lot
(for perl).

- (internally) methods that perform auto-learning
  
  learn
  
  Proposed class: Mail::SpamAssassin::AutoLearn?   (I don't think
  mushing into PerMsgLearner, Bayes, or Mail::SpamAssassin makes
  sense, so a new class would be better.)
 
 I think there's too much breaking up of stuff here.  Bayes would be
 fine.

yeah, OK, Bayes is probably good enough alright.

 Do we really need to do this now?  This is not going to significantly
 help performance, accuracy, or memory usage, is it?
 What's the effect on stability?
 How does this affect our release cycle?

ok, ok.  it's not much use to any of those -- but the all mushed into one
class-ness of PerMsgStatus is really driving me nuts ;)It's far from
good OO design.  And bad code smell is an indicator that there are
inefficiencies there.

I do have an idea for improving performance -- separate mail to follow.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWNeIQTcbUG5Y7woRAljYAJ9jP0fX4MoLlSVzZgmYT8gmylA90wCfZgXw
Zgj4vTKNAwhG6jQL7QAkkPU=
=nPeb
-END PGP SIGNATURE-



speedup for PerMsgStatus

2004-09-28 Thread Justin Mason
OK, here's a trick I was thinking about. Currently we have these massive
hashtable refs:

$pms-{conf}-{rbl_evals}
  {head_tests}
  {body_tests}
  
  {scoreset}-[0,1,2,3]
  {tflags}

Each of those is keyed by the name of the rule.

Now the thing is, this is really wasteful - speed-wise (not really
RAM-wise) -- just performing all those hash lookups!   When a message is
scanned, each of the _evals and _tests hashes are iterated over,
extracting the rule name and rule text for every entry. In reality, we
only need the rule text at this point, *not* the name.

  - We have about 700 rules

  - 99% of the time, any given rule will NOT fire, so we should speedup:

foreach my $rulepat (@{all_rules_of_given_type}) {
  ...
  if ($whatever =~ /$rulepat/) {
# hit!
  }
  # otherwise miss!
}

we should speedup the 'foreach', the rule-text fetch, and the 'miss'.
note that we don't need to know the rule name until the rule gives
us a hit!

so I'm thinking that we should replace parts of this with arrays, using
integer indexes, instead of hashes with string indexes.

Array lookups are quite a bit faster than hash lookups.

Each array would have RAM usage of -- guessing -- (size_of_whats_stored +
9100) bytes, since arrays in perl have an overhead of about 13 bytes per
entry.  (this is about the same as hashes iirc, poss a bit less.  not sure
if there'd be RAM savings there, since perl hash keys are refcounted
shared strings iirc.)

we can optimize for the rules that are loaded from the system-wide config,
because (a) allow_user_rules is almost always off, and (b) even if it's
on, I'd guess that most times 99% of the rules that a scan runs would be
system-wide rules anyway.   (we can deal with user-rules by just pushing
them onto the rules array when they're defined, same as the system rules
are done.)

--j.


Re: speedup for PerMsgStatus

2004-09-28 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Loren Wilton writes:
   I have no idea how painful linked lists are in Perl (or if they even
 exist).
 
  Why are you commenting then???
 
 Because they are very useful, as I pointed out.
 
  They don't exist as a native data structure.  Arrays are fast, painless,
  and dynamically sized.
 
 They don't exist as a native data structure in C++ either.  But they get a
 lot of use. Even when template classes exist to do reasonably fast and
 reasonably painless dynamic arrays.  For certain things (like collections of
 objects that can get reordered frequently) they are generally more efficient
 than dynamic arrays.
 
 If there is an SA coding requirement for only using native data structures,
 then forget lists.  If no such requirement exists and there is an interest
 in optimizing performance, then they should be a tool to be considered.

Unfortunately, perl speed optimisation doesn't work like that.
The reason is that perl native data structures (arrays, hashes, strings,
numeric SVs, etc.) can be looked up in one perl OP, but a user-defined
data structure cannot.

The OP is the lowest level command in the perl VM, equivalent to an
assembly opcode, and as such is very very fast -- since the innards of an
OP is pure C.   That's why regexp matching in perl is as fast as it
is in C -- because a regexp match is compiled to a single OP.

(Perl's not like Java in that respect.  Perl's vm has quite high-level
opcodes, whereas java's is more like real assembly and more low-level.
that's why perl is faster than java ;)

Unfortunately when reading fields in a perl data structure like a hash or
array, and traversing reference chains, each variable access, and
ref derefence, is an individual OP.

So the upshot is that using a native perl data type will always be
faster than defining a new non-native data type structure in perl.

cf. http://www.ccl4.org/~nick/P/Fast_Enough/#ops_are_bad,_m%27kay
for more details...   in fact, I'm even considering looking into some
use of pack() here for the very reasons noted here ;)

(ps.  I'm sure if I got any of that wrong Matt will correct me ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWQRrQTcbUG5Y7woRAvGHAJwOAxmPKpX09LoiZBCsYypL5UzA2ACgvbTm
6uB3igI7ObXF+vn+jeOmN98=
=cQEI
-END PGP SIGNATURE-



Re: class renaming

2004-09-29 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


   As much as I loved to have this thing renamed, why didn't we do this
   *before* we released 3.0?  Or to quote you from bug 3668: there's *no
   way* I'd be happy making any of these changes before 4.0.0 ;) 
   (Actually, the no way is exaggerated but I don't like the idea at
   this point).
 
  Well, that's a different kettle of fish -- bug 3668 is changing
  configuration file paths, this is changing a class name, and ensuring
  that backwards compatibility is preserved for that change.
 
 That other bug was also about changing something newly introduced where we 
 wouldn't have to watch out for backwards compatilility :)
 
 Whatever, what I wanted to say is that I'm not opposed to the idea itself 
 and especially if it has any speed and memory advantages I'm all for it.  
 I'm just afraid that such a major change at this early point might brake at 
 some unexpected place as much as we try to stay backwards-compatible.

Yeah, I think at this point we have 3 devs saying -1, so I don't
think it's going to happen anyway ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWz1TQTcbUG5Y7woRAqWUAJ9s42vW4bfMzCXb8ZbrxLGkr2/yvwCffIqm
o8S977wFZaCeqR3WwjKe4TQ=
=vSnU
-END PGP SIGNATURE-



Re: svn commit: rev 47510 - spamassassin/trunk

2004-09-30 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Michael Parker writes:
 On Wed, Sep 29, 2004 at 10:21:06PM -, [EMAIL PROTECTED] wrote:
  +- MIMEDefang: version 2.42 or later.
 
 FWIW, I completely disagree with doing this.  A) It will give the
 impression that we support these programs (I assume there will
 eventually be more), B) How are we verifying that the version listed
 actually works? C) Is someone going to test every single release
 against each program we have listed to make sure the information is
 still valid? D) What criteria are we using to decide which programs
 get listed?

  (A) well, we *do* to a degree [*]
  (B) what users/devs of those tools report on the list
  (C) no
  (D) the volume of traffic from people asking these questions

[*]: SpamAssassin is NOT just a mail filter.  It's also a suite of perl
modules to perform spam identification inside other mail filters. amavisd,
MIMEDefang et al are therefore supported products into which SpamAssassin
can be plugged.  Therefore we have to consider what documentation will
help people who use those apps in using SpamAssassin.

Having said all that, I'd be +1 on taking that out of UPGRADE, replacing
with a pointer to a wiki page which contains that info.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBWz4wQTcbUG5Y7woRAoq6AKC5uxOr8o6AjxcLZovVxZSPnsUcKgCfcobU
XzC6ZAT0rSshWXef5lIjlow=
=r7+0
-END PGP SIGNATURE-



Re: svn commit: rev 47516 - spamassassin/trunk

2004-09-30 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Malte S. Stretz writes:
 Why I sort that file now and then is because it makes it much easier to see 
 if a file is already in there or remove one which is gone.  Keeping the 
 MANIFEST up-to-date is already a PITA and an unsorted file makes it even 
 worse (ok, there are grep and friends but I think its faster to scan the 
 file with your eyes instead of calling some command).

make distcheck works for me ;)
make disttest is also useful -- if a file is missing, it should
cause a test to fail anyway.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBXDeIQTcbUG5Y7woRAjWCAJ4oFDL80ZRaNoLEeVUjEOpNCU4CRACfaSCD
NvF+wmqPaQ7UfrSvdIT7Lg8=
=n0Qq
-END PGP SIGNATURE-



Re: svn commit: rev 47516 - spamassassin/trunk

2004-09-30 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Malte S. Stretz writes:
 On Thursday 30 September 2004 18:42 CET Justin Mason wrote:
  Malte S. Stretz writes:
   Why I sort that file now and then is because it makes it much easier to
   see if a file is already in there or remove one which is gone.  Keeping
   the MANIFEST up-to-date is already a PITA and an unsorted file makes it
   even worse (ok, there are grep and friends but I think its faster to
   scan the file with your eyes instead of calling some command).
 
  make distcheck works for me ;)
  make disttest is also useful -- if a file is missing, it should
  cause a test to fail anyway.
 
 Yeah, they are useful but do you call them after (or better: before) each 
 commit?  I do so before each bigger change but for small things I often 
 simply forget it (or avoid it because it can take ages).

before every commit where you've added or removed files.  no question of
that ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBXD9iQTcbUG5Y7woRAu5fAKCtEf030gQrTrtfFXtXui8uxxeXLQCg4Wx8
8e3RNo6Qxmg6U/+K2rcOcpQ=
=ZFBz
-END PGP SIGNATURE-



Re: [Bug 3848] SA 3.0 time outs with amavis+razor

2004-09-30 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


actually, you're right on both; I just checked with perl -e in perl 5.8.4.
I must have been thinking of java instead of perl ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBXEEhQTcbUG5Y7woRApMlAJ4ykLqTSFEDQAwqRAlyLO1wP/q2lACgp9zn
OMvd703Ss/p7/n3lSrbgRz8=
=wo5U
-END PGP SIGNATURE-



Sequence analysis/bioinformatics

2004-09-30 Thread Justin Mason
A very interesting paper at Toorcon -- the use of bioinformatics
techniques to perform black-box protocol reverse-engineering.

Again, this is likely to be useful for automated discovery of antispam
regexp rules...  worth a read:

http://www.baselineresearch.net/PI/PI-Toorcon.pdf

--j.


Re: svn commit: rev 51805 - spamassassin/trunk

2004-10-02 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


[EMAIL PROTECTED] writes:
 Author: mss
 Date: Sat Oct  2 08:29:31 2004
 New Revision: 51805
 
 Modified:
spamassassin/trunk/Makefile.PL
 Log:
 Just for fun...

what does this do?   could we get some more descriptive commit messages,
and possibly some discussion before the top-level Makefile.PL is changed
like this?

also, how does make manifest update the manifest?  The whole idea
of that file is that it is *manually* maintained, not automatically,
to avoid accidental inclusion of built files.

- --j.

 Modified: spamassassin/trunk/Makefile.PL
 ==
 --- spamassassin/trunk/Makefile.PL(original)
 +++ spamassassin/trunk/Makefile.PLSat Oct  2 08:29:31 2004
 @@ -198,7 +198,10 @@
  'dist' = {
  COMPRESS = 'gzip -9f',
  SUFFIX = 'gz',
 -DIST_DEFAULT = 'tardist'
 +DIST_DEFAULT = 'tardist',
 +
 +CI = 'svn commit',
 +RCS_LABEL = 'true',
  },
  
  'clean' = { FILES = join(' ' =
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBXvPoQTcbUG5Y7woRAimiAJsFP/LZrYOOUsHTj4Df4tGGnwUu6QCfVQHU
9tbF+n0sjUId/8UkeHxcUcQ=
=Yz6h
-END PGP SIGNATURE-



Re: svn commit: rev 53755 - spamassassin/trunk/lib/Mail/SpamAssassin/Plugin

2004-10-04 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


BTW, note that plugins *should* be able to push their own entries
onto the $conf-{registered_commands} list.   That is, in my opinion,
much cleaner than the current parse_config() API, and may be worthwhile
as a way for future plugins to do configuration.

May need a little work, though ;)

- --j.

[EMAIL PROTECTED] writes:
 Author: felicity
 Date: Mon Oct  4 15:16:21 2004
 New Revision: 53755
 
 Modified:
spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Hashcash.pm
spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Razor2.pm
 Log:
 the hashcash and razor2 plugins use the standard parser functions to
 set values from the configuration.  however since there's no way to
 deal with the errors in a standard manner right now (see bug 3869),
 set a standard-ish function in the plugin itself to deal with issues.
  basically the same code as the parser itself.
 
 Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Hashcash.pm
 ==
 --- spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Hashcash.pm   
 (original)
 +++ spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Hashcash.pm   Mon Oct 
  4 15:16:21 2004
 @@ -68,6 +68,7 @@
my $conf = $opts-{conf};
my $key = $opts-{key};
my $value = $opts-{value};
 +  my $line = $opts-{line};
  
  =over 4
  
 @@ -78,7 +79,11 @@
  =cut
  
if ( $key eq 'use_hashcash' ) {
 -$conf-{use_hashcash} = $value+0; return 1;
 +$self-handle_parser_error($opts,
 +  Mail::SpamAssassin::Conf::Parser::set_numeric_value($conf, $key, 
 $value, $line)
 +);
 +$self-inhibit_further_callbacks();
 +return 1;
}
  
  =item hashcash_accept [EMAIL PROTECTED] ...
 @@ -100,7 +105,9 @@
  =cut
  
if ( $key eq 'hashcash_accept' ) {
 -$conf-add_to_addrlist ('hashcash_accept', split (/\s+/, $value)); 
 return 1;
 +$conf-add_to_addrlist ('hashcash_accept', split (/\s+/, $value));
 +$self-inhibit_further_callbacks();
 +return 1;
}
  
  =item hashcash_doublespend_path /path/to/file   (default: 
 ~/.spamassassin/hashcash_seen)
 @@ -116,7 +123,11 @@
  =cut
  
if ( $key eq 'hashcash_doublespend_path' ) {
 -$conf-{hashcash_doublespend_path} = $value; return 1;
 +$self-handle_parser_error($opts,
 +  Mail::SpamAssassin::Conf::Parser::set_string_value($conf, $key, 
 $value, $line)
 +);
 +$self-inhibit_further_callbacks();
 +return 1;
}
  
  =item hashcash_doublespend_file_mode(default: 0700)
 @@ -130,11 +141,47 @@
  =cut
  
if ( $key eq 'hashcash_doublespend_file_mode' ) {
 -$conf-{hashcash_doublespend_file_mode} = $value+0; return 1;
 +$self-handle_parser_error($opts,
 +  Mail::SpamAssassin::Conf::Parser::set_numeric_value($conf, $key, 
 $value, $line)
 +);
 +$self-inhibit_further_callbacks();
 +return 1;
}
  
return 0;
  }
 +
 +sub handle_parser_error {
 +  my($self, $opts, $ret_value) = @_;
 +
 +  my $conf = $opts-{conf};
 +  my $key = $opts-{key};
 +  my $value = $opts-{value};
 +  my $line = $opts-{line};
 +
 +  my $msg = '';
 +
 +  if ($ret_value  $ret_value eq $Mail::SpamAssassin::Conf::INVALID_VALUE) {
 +$msg = config: SpamAssassin failed to parse line, .
 +   \$value\ is not valid for \$key\, .
 +   skipping: $line;
 +  }
 +  elsif ($ret_value  $ret_value eq 
 $Mail::SpamAssassin::Conf::MISSING_REQUIRED_VALUE) {
 +$msg = config: SpamAssassin failed to parse line, .
 +   no value provided for \$key\, .
 +   skipping: $line;
 +  }
 +
 +  return unless $msg;
 +
 +  if ($conf-{lint_rules}) {
 +warn $msg.\n;
 +  } else {
 +dbg($msg);
 +  } 
 +  $conf-{errors}++;
 +  return;
 +} 
  
  ###
  
 
 Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Razor2.pm
 ==
 --- spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Razor2.pm (original)
 +++ spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Razor2.pm Mon Oct  4 
 15:16:21 2004
 @@ -87,7 +87,9 @@
  =cut
  
if ($key eq 'razor_timeout') {
 -Mail::SpamAssassin::Conf::Parser::set_numeric_value($conf, $key, $value, 
 $line);
 +$self-handle_parser_error($opts,
 +  Mail::SpamAssassin::Conf::Parser::set_numeric_value($conf, $key, 
 $value, $line)
 +);
  $self-inhibit_further_callbacks();
  return 1;
}
 @@ -100,13 +102,48 @@
  =cut
  
if ($key eq 'razor_config') {
 -Mail::SpamAssassin::Conf::Parser::set_string_value($conf, $key, $value, 
 $line);
 +$self-handle_parser_error($opts,
 +  Mail::SpamAssassin::Conf::Parser::set_string_value($conf, $key, 
 $value, $line)
 +);
  $self-inhibit_further_callbacks();
  return 1;
}
  
return 0;
  }
 +
 +sub handle_parser_error {
 +  my($self, $opts, $ret_value) = @_;
 +
 +  my $conf = $opts-{conf};
 +  my 

Re: improving SURBL without the foot-shooting

2004-10-06 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Kelsey Cummings writes:
 On Tue, Oct 05, 2004 at 03:25:55AM -0700, Jeff Chan wrote:
 4. SURBL query traffic
  
mostly good if you subtract the blacklisted ones
  
  But any big, as-yet-undetected spam domains can also generate
  much traffic.
 
 What if you were to have a friendly ISP that would be willing to send you
 an anonymized data feed that looked something like:
 
 sa scoretabspam/hamtaburltaburltaburl\n
 
 It wouldn't be very hard to send this information in realtime. 

funnily enough, I have some IPC::DirQueue code to do this in
a low-impact, low-load manner ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBYza+QTcbUG5Y7woRAnKDAKDSpmPVgnBeEk12LdKzjxc5I8Z0RACfTCS1
IOQYMqqh1RyTvuCTb0LZnqo=
=pc3H
-END PGP SIGNATURE-



Re: improving SURBL without the foot-shooting

2004-10-06 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Kelsey Cummings writes:
 On Tue, Oct 05, 2004 at 05:05:18PM -0700, Justin Mason wrote:
  Kelsey Cummings writes:
   On Tue, Oct 05, 2004 at 03:25:55AM -0700, Jeff Chan wrote:
   4. SURBL query traffic

  mostly good if you subtract the blacklisted ones

But any big, as-yet-undetected spam domains can also generate
much traffic.
   
   What if you were to have a friendly ISP that would be willing to send you
   an anonymized data feed that looked something like:
   
   sa scoretabspam/hamtaburltaburltaburl\n
   
   It wouldn't be very hard to send this information in realtime. 
  
  funnily enough, I have some IPC::DirQueue code to do this in
  a low-impact, low-load manner ;)
 
 I was actually thinking the easiest way to pass this data in realtime would
 be to send it to surbl's colo at sonic via syslog.  SA can already
 generate it and syslogd can write to a named pipe for processing.  Makes
 it easy to get running.

well, that's true!  didn't think of that.

 But, IPC::DirQueue is useful.  Taking a queue from it I rewrote all of my
 spam processing stuff to operate as a Maildir client.  A single thread has
 proven to be fast enough and alot better than passing to each processing
 bit via procmail.  If I find that it needs more than one processing thread
 to keep up I'll probably go steal lots of your code. :-p

that's the thing -- it's designed so that if you need more threads, just
start more processes.  you don't even need to synchronize them externally,
it does it itself!

(ps: did you notice I put up a version that does the hashing thing
you suggested?)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBYzs1QTcbUG5Y7woRAodqAJ98/Q7IfuXdGpY2s+GKzzXjr4mmjgCfUa45
QnSi8VGgTtczz4IgubtH+gs=
=xqNK
-END PGP SIGNATURE-



Re: What's up with reviewing tickets?

2004-10-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Theo Van Dinter writes:
 There are currently 5 (of 8) tickets in the 3.0.1 queue in the review
 state.  One has been in review state since 9-29 (3831) and needs another
 +1, 3872 is major (needs another +1), 3741 and 3865 had patch added today,
 and 3806.
 
 I'd like to get 3.0.1 out either this week or next, BTW.

agreed -- slow reviewing is not a good thing... all the patches
are quite simple and amenable to visual review.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBafLEMJF5cimLx9ARAiFsAJ9xfhIHMG5klme53i7ppxWgyjJS3gCgunJF
e8nXmjCcJlRohOEIwUK4mA8=
=B7ol
-END PGP SIGNATURE-



Re: limit on number of URIs decoded?

2004-10-13 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Sidney Markowitz writes:
 Justin Mason wrote:
  The first fix is truncation of the text before passing to TextCat.
  Michael, I think you were looking at this?  the results are impressive,
  if the text is truncated to 32k bytes:
 
 It was me.

oops! sorry ;)

 I've been looking at ways to not have to create so much 
 garbage (I'm a lisp hacker -- I'm not using the word in the pejorative 
 sense) in that loop in create_lm, but the simplest way of dealing with 
 it this is to truncate $input to perhaps 10,000 bytes in the call to 
 create_lm. Since TextCat is just a heuristic for determining the 
 language and there is no incentive for spammers to, for example, prefix 
 a Spanish language message with 10,000 bytes of English words just to 
 slip through the spam filters of English-only speakers, the first 10,000 
 bytes is plenty as a limit. Language recognition accuracy does not 
 improve noticeably past one or two thousand characters, while going to 
 less than 10,000 does not provide much additional speed or memory 
 benefit. If there is no real language text in the first 10,000 
 characters of rendered body, then it will not be recognized as any 
 language and the rule will not fire, failing safely.
 
 I propose putting in the truncate for 3.0.1 as a quick and safe way of 
 around the problem we saw with that malformed MIME message. I'll keep 
 playing with the loop just in case I can speed it up enough for the 3.1 
 time frame to not have to truncate, but we should do the quick fix right 
 away.

+1 on truncation.   I think it's safe for 3.1.0 as well, fwiw.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBbMg7MJF5cimLx9ARAtFHAJ9USbLtlALQNyPh2zO8vY7Ij8iK9wCguY/9
AGySenolwH+E8IPoMDPlXN0=
=nsK7
-END PGP SIGNATURE-



Re: svn commit: rev 54716 - in spamassassin/trunk: . t

2004-10-14 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Sidney Markowitz writes:
 Justin Mason wrote:
  the test should be a no-op without that module did that not work?
 
 This is extracted from output of make test, running under Cygwin with 
 perl 5.8.5
 
 t/memory_cycles.Can't locate Devel/Cycle.pm in @INC (@INC 
 contains:
 t . ../blib/lib /c/sasvn/trunk/blib/lib /c/sasvn/trunk/blib/arch 
 /usr/lib/perl5/5.8.5/cygwin-thread-multi-64int /usr/lib/perl5/5.8.5 
 /usr/lib/perl5/site_perl/5.8.5/cygwin-thread-multi-64int 
 /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl 
 /usr/lib/perl5/vendor_perl/5.8.5/cygwin-thread-multi-64int 
 /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl) at 
 t/memory_cycles.t line 66.
 BEGIN failed--compilation aborted at t/memory_cycles.t line 66.

oops. try current svn... r54765 should fix it...

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBbc+mMJF5cimLx9ARAuVnAJ4xz4LDlgaKhwiCwEq86PLmp1xwjwCgjdtZ
y7K4FA/HB4B1emcrhelzBmI=
=d25n
-END PGP SIGNATURE-



Re: 3.0.1 this week?

2004-10-21 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Michael Parker writes:
 On Thu, Oct 21, 2004 at 12:01:20AM -0400, Theo Van Dinter wrote:
  I'd like to get 3.0.1 released in the next few days.  There are 2 tickets 
  left
  in the queue: can we get them done up in the next day or so?
 
 +1 on a release soon.

+1 here too.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBd0lZMJF5cimLx9ARAqm4AJ0cZqTQ/N3CiKHl3+cyQP466DgaiQCgm09x
JoIjMy6GUhnwwgnV2QpDtDw=
=2UMQ
-END PGP SIGNATURE-



Re: VOTE: release 3.0.1

2004-10-22 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 I propose we release SpamAssassin 3.0.1.  All bugs are closed now.

+1
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBeGTKMJF5cimLx9ARAnavAJ9hJN088VrH7LM1eHiPXr9DJ7xeLACght6V
3dqofd78+gOrQqUKyk5FLBs=
=D2/q
-END PGP SIGNATURE-



SpamAssassin 3.0.1 is released!

2004-10-23 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


SpamAssassin 3.0.1 is released!  3.0.1 contains some important
bugfixes, and is recommended.

Highlights:

  - excessive memory-usage fixes
  - bug fixed which stopped DCC, Pyzor working with amavisd
  - deprecate RCVD_IN_RFC_IPWHOIS
  - user_prefs were staying active between different spamd users, fixed
  - user_prefs blacklist entries were not working in spamd, fixed
  - excessive time and memory consumption when ok_languages is used, fixed
  - sa-learn -u switch to specify the username for virtual environments
  - avoid bug in Sys::Hostname::Long that renames the hostname when make
test is run
  - whitelist the top 125 queried SURBL domains common in nonspam

Pick it up at http://spamassassin.apache.org/ !

md5sum of archive files:
  83f60f97c823d9b8df19309247fe33eb  Mail-SpamAssassin-3.0.1.tar.bz2
  759e0486b07c4a03aa340d4a04e1d849  Mail-SpamAssassin-3.0.1.tar.gz
  e42d4f6b7228f899efdfdce03b8851a0  Mail-SpamAssassin-3.0.1.zip

sha1sum of archive files:
  7ad929efc388ebdf26da052c6fca958c7541bb4f  Mail-SpamAssassin-3.0.1.tar.bz2
  a3aebae1bf3c97830e540c42dc64791787d966c9  Mail-SpamAssassin-3.0.1.tar.gz
  e4f23ad8251914bb240a4e42438310a263ca5056  Mail-SpamAssassin-3.0.1.zip


The release files also have a .asc accompanying them.  The file serves
as an external GPG signature for the given release file.  The signing
key is available via the wwwkeys.pgp.net key server, as well as
http://spamassassin.apache.org/released/GPG-SIGNING-KEY

The key information is:

pub  1024D/265FA05B 2003-06-09 SpamAssassin Signing Key [EMAIL PROTECTED]
 Key fingerprint =3D 26C9 00A4 6DD4 0CD5 AD24  F6D7 DEE0 1987 265F A05B

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBectWMJF5cimLx9ARAh2DAKCBru7brC0dtjD4G2/QGvAmWntURgCgoKBp
J1C/3vGNxtuJcxuosscN+E4=
=RAAd
-END PGP SIGNATURE-



Re: svn commit: rev 55350 - spamassassin/site

2004-10-23 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


 -font-family: verdana,lucida,helvetica,sans-serif;
 +font-family: arial,helvetica,sans-serif;

just to reiterate -- I'm -1 on this change.  It looks awful
by comparison (where Verdana is available), at least under
Firefox on linux.

Some discussion and agreement is essential before changing
branding elements like this!

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBedkCMJF5cimLx9ARAiCyAKCdbh2AeOaig0yqFM886loey609gACfZq7A
lf5tTovLid57Xy605pAnkRE=
=0Jj8
-END PGP SIGNATURE-



3.0.1 /dist/ area screwups

2004-10-25 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Theo Van Dinter writes:
 On Fri, Oct 22, 2004 at 08:09:10PM -0700, Justin Mason wrote:
  SpamAssassin 3.0.1 is released!  3.0.1 contains some important
  bugfixes, and is recommended.
 
 Another couple of notes about the release.  Apparently the
 dist/spamassassin/source files for 3.0.0 were removed -- so the only version
 available for download now is 3.0.1.  Don't we want to keep the older
 version(s) available for at least some period of time?

This is going by what the ASF guidelines for usage of the mirrored
www.apache.org/dist/ say *must* be done.   see
http://www.apache.org/dev/mirrors.html ,
http://httpd.apache.org/dev/release.html ,
http://cvs.apache.org/~bodewig/mirror.html ,
http://jakarta.apache.org/site/convert-to-mirror.html .
However, I think I agree -- leaving the old versions there for
a short while makes more sense.   Take a read over those and see
what you think.

The fundamental problem this time around was that I miscomputed that
?update parameter.  we should create a simple build script that generates
the correct value for us to cut and paste and cut down on faulty
brain-work. ;)

There *is* another problem, though -- since the downloads.html/.cgi page
is on the single un-mirrored site, and the downloads are on the mirrors
which may be up to 24 hours out of sync, we would still have to use the
?update=200409211830 parameter on the downloads.cgi URL to ensure that
only up-to-date mirrors are used; otherwise the download link will either

- (a) if it points to Mail-SpamAssassin-3.0.1.tar.gz, return a 404

- (b) if it points to Mail-SpamAssassin-current.tar.gz, return the old
  file which will not match the checksums, and that's not good.

 Also, the dist/spamassassin/source files were removed, but not the symlinks to
 them in dist/spamassassin -- so there were 12 bad symlinks lying around.
 I've already received a complaint note about it, so I removed the bad 
 symlinks.

oops.  my fault!  we need to update build/README to reflect that.

 I really don't understand why we put the source files in the source directory,
 and then have symlinks for them all in the parent directory.  Just put the
 source files in the parent directory!

Again, ASF guidelines.  It might be worth asking infrastructure@ if the
guidelines can be ignored in this case... although I'm not sure there's
a big win.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBfGCmMJF5cimLx9ARAiUjAJ43Mzilp/NpIkAlD/nPSbhm3cGqPACdHzSR
tc6h+C3KAq2K9PCWvbW6M9M=
=9cda
-END PGP SIGNATURE-



Re: debug levels in trunk

2004-10-27 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 was: Re: [Bug 3931] [review] remove the annoying 'inhibited further 
 callbacks' debug message
 
  (a) new debug code in 3.1.0 doesn't have higher debug levels
 
  Really?  That kind of sucks (although we never really used it anyway...)
 
 While we have debug levels in trunk ...
 
  - dbg()  debugging message
  - info() informational message (okay to be logged by spamd always)
  - warn() something went very wrong
  - die()  ouch!
 
 ... I agree that we do not need more verbose debugging levels than
 dbg().  I think more verbose than dbg() means you comment out the
 dbg() statement.  :-)

yeah, I really agree -- I have used higher debugging levels only *once*;
for the RBL code.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBgBvBMJF5cimLx9ARAgtJAJwPD4JkBgSedM2nGJNshD0avFfqRgCgpTrt
uj1va0rqYSVnZ8it5BYX8g0=
=C8qn
-END PGP SIGNATURE-



Re: [Bug 3940] ArchiveIterator uses opt_j for two different things

2004-11-01 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


  I'd strongly prefer (I'm probably -1 on creatingh two new options for
  this one) to keep opt_j as the number of processes (it parallels make
  -j) and add a new option for the temporary file vs. in-memory option.
  The temporary file thing postdates -j by a long period and can just move
  to a new option.
 
 I think just adding a new option for storage is doable, but FWIW I don't
 really care about parallels ... -j since this is all internal API names.
 The commandline can stay the same, but unless you're used to the module,
 opt_j isn't very descriptive of what the value means.

oh btw, on that point, I'd be very pro adding *new*, meaningful names
for opt_j, opt_n et al as they are used in the
M:SpamAssassin:ArchiveIterator class, and leaving opt_j, opt_n et al
as backwards-compat aliases.   I agree, they don't make much sense
for users of that module apart from mass-check.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBhYJGMJF5cimLx9ARAnZWAKCFHqthUt9p7kCQJTqkLsAjBqXWTACZAcd+
lReIi6mhyf165yWgmgAmtJI=
=BMAd
-END PGP SIGNATURE-



Re: svn commit: rev 56270 - spamassassin/trunk/masses

2004-11-01 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


 Log:
 work on the mass-check output a bit, state when scan has ended and
 run begins (rough approximation since the run has already begun at
 that point), format the lines better, etc.

hey btw are we going to merge Duncan's mods?  we really should, that
code will rot otherwise.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBho24MJF5cimLx9ARAsOuAJ9Gs2Ge8AHpMRA0JB5gtPc4ToAbdwCbBxlb
HabTXF7ghvmZNTFi0ZQu9U0=
=/zhh
-END PGP SIGNATURE-



proposal: an automated rule-qa system

2004-11-19 Thread Justin Mason
So, we were discussing the rules situation -- ie. that we've been pretty
crap at getting rules into the distro. I proposed this, and I think we're
reasonably into the idea as a way to help out.

We add a web-app somewhere that periodically scrapes bugzilla
for bugs on the rules component which contain some token from trusted
users indicating that they contain rules that need testing.

That then extracts rules from attachments/text on that bug, and

- (a) checks out SVN trunk
- (a) adds them to the rules dir of that in a temporary file
- (b) runs a mass-check on those rules
- (c) does simple lint using spamassassin --lint and
  lint-rules-from-freqs
- (d) does some kind of basic S/O testing
- (e) it may be that we can also check in the rules into SVN for a full
  nightly mass-check from all the people doing those, in which case it
  should come up with the results from that, nicely snipped out of the
  full reports.
- (f) if we do (e), we can even get the results, segmented by the age of
  the corpus used!  in other words, give us a picture of the freqs based
  on how old the messages it was hitting on were.
- (g) -- possibly -- do a quick perceptron run to evaluate if the rule
  overlaps with other rules too much.

Finally, it'll display the results at a given URL -- probably based on the
bug and comment numbers, so it's easily hyperlinkable.

Using bugzilla as the backend is useful, btw, as that gives us

  - threaded discussion of rules
  - contributor CLA status tracking
  - good ways to get lists and overviews of what contributions are
available and their status
  - gatewayed to mailing list, and viewable via www

Sound useful?  That should at least take some legwork out of rule QA,
and stop us committers being a bottleneck in the process.

--j.


Re: Java client to spamd

2004-11-19 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Kurt Humes writes:
 I am begining to build a Java Libray to act as a client to spamd, not
 using JNI however.  Has anyone ever done something similar and if so
 what are the roadblocks that you have come across.

Kurt, I'm unaware of anything, but it should be very, very
straightforward.

(only (minor) roadblock: there was a bug in whitespace handling at the end
of the server response to one of the request verbs, can't rememmber which
one, but it's documented in spamd/PROTOCOL.)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBnn5xMJF5cimLx9ARArbIAKCx/cCfhv0813QtyDF6lRC0zY9p+gCfcukJ
1R7sGioj2UFAVNc7PJ1ZkiY=
=hAuU
-END PGP SIGNATURE-



Re: [SpamAssassin Wiki] Updated: FrontPage

2004-11-20 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


btw, I was thinking of keeping this link around, on the wiki at least, in
case the slides became available...  (hint hint ;)

- --j.

[EMAIL PROTECTED] writes:
Date: 2004-11-19T21:55:15
Editor: DanielQuinlan [EMAIL PROTECTED]
Wiki: SpamAssassin Wiki
Page: FrontPage
URL: http://wiki.apache.org/spamassassin/FrontPage
 
remove conference link
 
 Change Log:
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBn8tnMJF5cimLx9ARApoaAJ4h7fl7vzdFrRMu4YVzu5nnwzT/0ACeIQf5
rAj4dJ0o869R0r4CZ+Mv4jM=
=fZH0
-END PGP SIGNATURE-



TIP: very useful '%seen' trick

2004-11-20 Thread Justin Mason
this just came up on perl5-porters...
http://www.nntp.perl.org/group/perl.perl5.porters/96100 :

  Subject: Re: sharing hash-values
  From: btilly[at]gmail.com (Ben Tilly)
  ...
  I forgot who I first saw mention this, possibly gbarr, but the following
  variation on %seen seems to be the fastest in native Perl:

my %seen;
undef @[EMAIL PROTECTED];
for (@things) {
  if (exists $seen{$_}) {
...
  }
}

  This avoids creating the hash values entirely.  (Or at least it did a few
  revs of Perl ago.)
  Cheers,
  Ben

sure enough, using the shared undef SV as the magic value is 7% faster and
doesn't allocate the scalars to reduce RAM usage ;)   definitely the better
idiom.  Benchmark:

: jm 1122...; perl psc
Rate traditional  undef_keys
traditional 100014/s  -- -6%
undef_keys  106684/s  7%  --


script:

#!/usr/bin/perl -w

use Benchmark qw(:all);
use strict;

my @things = qw(
foo bar baz foo foo foo bar bar baz baz blarg
);

cmpthese (-2, {
'traditional' = sub {
my $res = '';
my %seen;
for (@things) {
  next if $seen{$_};
  $seen{$_} = 1;
  $res .= $_\n;
}
},
'undef_keys' = sub {
my $res = '';
my %seen;
# undef @[EMAIL PROTECTED];
for (@things) {
  next if exists $seen{$_};
  undef $seen{$_};
  $res .= $_\n;
}
}
  });


(ps: note the 'undef @[EMAIL PROTECTED];' -- can be used to undef a list of
already-seen special values before the loop.)

--j.


Re: svn commit: r106135 - /spamassassin/trunk/rules/20_head_tests.cf /spamassassin/trunk/rules/50_scores.cf

2004-11-22 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


btw, would a name prefix sound like a good idea for a convention to
indicate rules that exist to catch never-seen-in-the-wild spammer
exploits?  Something like EVIL or similar?

it'd provide a great way to (a) visually see that those rules not
firing is not a problem in hit-frequencies output, and (b) grep
them out for the same purpose.

- --j.

[EMAIL PROTECTED] writes:
 Author: quinlan
 Date: Sun Nov 21 14:49:14 2004
 New Revision: 106135
 
 Modified:
spamassassin/trunk/rules/20_head_tests.cf
spamassassin/trunk/rules/50_scores.cf
 Log:
 promote T_FRAGMENTED_MESSAGE to FRAGMENTED_MESSAGE
 
 Modified: spamassassin/trunk/rules/20_head_tests.cf
 Url:
 http://svn.apache.org/viewcvs/spamassassin/trunk/rules/20_head_tests.cf?view=diffrev=106135p1=spamassassin/trunk/rules/20_head_tests.cfr1=106134p2=spamassassin/trunk/rules/20_head_tests.cfr2=106135
 ==
 --- spamassassin/trunk/rules/20_head_tests.cf (original)
 +++ spamassassin/trunk/rules/20_head_tests.cf Sun Nov 21 14:49:14 2004
 @@ -27,6 +27,12 @@
  header HEAD_LONG eval:check_for_long_header()
  describe HEAD_LONG   Message headers are very long
  
 +# partial messages; currently-theoretical attack
 +# unsurprisingly this hits 0/0 right now.  But should we promote it anyway
 +# to protect against the possibility?
 +header FRAGMENTED_MESSAGEContent-Type =~ /\bmessage\/partial/i
 +describe FRAGMENTED_MESSAGE  Partial message
 +
  header MISSING_HB_SEPeval:check_for_missing_hb_separator()
  describe MISSING_HB_SEP  Missing blank line between message 
 header and body
  
 
 Modified: spamassassin/trunk/rules/50_scores.cf
 Url:
 http://svn.apache.org/viewcvs/spamassassin/trunk/rules/50_scores.cf?view=diffrev=106135p1=spamassassin/trunk/rules/50_scores.cfr1=106134p2=spamassassin/trunk/rules/50_scores.cfr2=106135
 ==
 --- spamassassin/trunk/rules/50_scores.cf (original)
 +++ spamassassin/trunk/rules/50_scores.cf Sun Nov 21 14:49:14 2004
 @@ -619,10 +619,9 @@
  # GTUBE - Generic Test for Unsolicited Bulk Email
  score GTUBE 1000.000
  
 -# long header test
 +# we dare you
  score HEAD_LONG 2.5
 -
 -# missing blank line between header and body
 +score FRAGMENTED_MESSAGE 2.5
  score MISSING_HB_SEP 2.5
  
  # HTML control test
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBoWTCMJF5cimLx9ARAh+3AJ9ecYONAcMjCwbioiqQM70kxBV4KwCgh4+A
TG2qxiUfpF1l0YAMunQ07xY=
=1OZx
-END PGP SIGNATURE-



Re: svn commit: r106170 - /spamassassin/trunk/spamd/spamd.raw

2004-11-22 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 [EMAIL PROTECTED] writes:
 
  [...]
   sub service_unavailable_error {
 my ($err) = @_;
 my $resp = EX_UNAVAILABLE;
  -  print $client SPAMD/1.0 $resphash{$resp} Service Unavailable: $err\r\n;
  +  syswrite( $client, SPAMD/1.0 $resphash{$resp} Service Unavailable: 
  $err\r\n );
 logmsg(service unavailable: $err);
   }
 
 Please try to use the more standard perl formatting:
 
   http://wiki.apache.org/spamassassin/CodingStyle
 
 Thanks!

ah, the foo( bar ) vs. foo(bar) style issue ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBojZzMJF5cimLx9ARAnMvAKCBfg2Z0B/LDlRlnOW5eBXqkpJzgACbBiiB
TvayyMOJ8XHjvmhxmKZy1p4=
=NDK9
-END PGP SIGNATURE-



Re: svn commit: r106170 - /spamassassin/trunk/spamd/spamd.raw

2004-11-22 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Sidney Markowitz writes:
 Daniel Quinlan wrote:
  Please try to use the more standard perl formatting:
 
 Do you see anything wrong other than two of the lines being more than 80 
   characters? I'll check in an update to fix that as soon as I finish 
 running a make test on the change.

Sidney -- I think it's the

foo( bar )

vs.

foo(bar)

style thing.  Daniel prefers the latter -- no extra spaces after the
bracket, and we've agreed to go with that. ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBojbwMJF5cimLx9ARAq2sAJ9n90D3h0q567L1ZD4GO9Fy9g9LGwCgmOoh
k3AFTFQV8Z7dmCrpEbjDILE=
=HsD8
-END PGP SIGNATURE-



Re: svn commit: r105955 - in spamassassin/trunk: . lib/Mail

2004-11-24 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


reminder: do we have a consensus what to do about this?  can we reinstate
the functions in the meantime?

- --j.

[EMAIL PROTECTED] writes:
 Author: quinlan
 Date: Sat Nov 20 00:17:49 2004
 New Revision: 105955
 
 Modified:
spamassassin/trunk/lib/Mail/SpamAssassin.pm
spamassassin/trunk/spamassassin.raw
 Log:
 bug 3856: remove debug_diagnostics() from Mail::SpamAssassin
 
 Modified: spamassassin/trunk/lib/Mail/SpamAssassin.pm
 ==
 --- spamassassin/trunk/lib/Mail/SpamAssassin.pm   (original)
 +++ spamassassin/trunk/lib/Mail/SpamAssassin.pm   Sat Nov 20 00:17:49 2004
 @@ -1159,35 +1159,6 @@
  
  ###
  
 -=item $f-debug_diagnostics ()
 -
 -Output some diagnostic information, useful for debugging SpamAssassin
 -problems.
 -
 -=cut
 -
 -sub debug_diagnostics {
 -  my ($self) = @_;
 -
 -  foreach my $module (sort qw(
 -Net::DNS Razor2::Client::Agent MIME::Base64
 -IO::Socket::UNIX DB_File Digest::SHA1
 -DBI URI Net::LDAP Storable
 -))
 -  {
 -my $modver;
 -if (eval ' require '.$module.'; $modver = $'.$module.'::VERSION; 1;')
 -{
 -  $modver ||= '(undef)';
 -  dbg(diag: module installed: $module, version $modver);
 -} else {
 -  dbg(diag: module not installed: $module ('require' failed));
 -}
 -  }
 -}
 -
 -###
 -
  =item $failed = $f-lint_rules ()
  
  Syntax-check the current set of rules.  Returns the number of 
 
 Modified: spamassassin/trunk/spamassassin.raw
 ==
 --- spamassassin/trunk/spamassassin.raw   (original)
 +++ spamassassin/trunk/spamassassin.raw   Sat Nov 20 00:17:49 2004
 @@ -240,7 +240,6 @@
  );
  
  if ( $opt{'lint'} ) {
 -  $spamtest-debug_diagnostics();
my $res = $spamtest-lint_rules();
warn lint: $res issues detected.  please rerun with debug enabled for 
 more information.\n if ($res);
exit $res ? 1: 0;
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBo/72MJF5cimLx9ARAnxLAJ9dZWJ56pvO49Pf6JlUjzltegzmZwCfcyDk
ZArJ2xKE6qz2EDaKqjDenG4=
=CNr3
-END PGP SIGNATURE-



req: volunteers to run buildbot slaves

2004-11-25 Thread Justin Mason
so we're setting up a distributed build-testing system,
BuildBot (http://buildbot.sourceforge.net/), for now at

  http://bugzilla.spamassassin.org:8010/

(that url may change.) it currently has 4 build slaves, building

- trunk using Red Hat 7.3's perl
- trunk using vanilla perl 5.6.1
- trunk using vanilla perl 5.8.5 with threading
- b3.0 using Red Hat 7.3's perl

If you fancy it, and are running an OS different from the above (!), it
might be worthwhile setting up a build slave to extend this... non-linux
platforms especially would be great.  Any platform where make test
currently passes, or nearly does, would be preferred ;)

Notes:

  - the slave process should be kept up and running as much as possible;
it's got to be a persistent daemon.

  - I'd recommend running as non-root, and not as your own userid. if a
miscreant managed to get hostile code into SVN trunk, it'd pretty
quickly get run on your machine by this code.

  - it's not *too* CPU hungry -- but will kick off a compile and make
test *every time* someone checks something into SpamAssassin svn!  so
if that puts you off, this isn't for you ;)

so pretty much, overall, this requires that you have root on some box
which has a 99%-uptime network connection to set a slave up.



Process to set up a build slave:


[install Twisted 1.3.0.  can be omitted if you already have it,
or just use sudo apt-get install twisted if you're on debian
unstable.]

[note that you also need python 2.2 or later installed.]

wget http://twistedmatrix.com/downloads/Twisted-1.3.0.tar.bz2
bunzip2 -cd  Twisted-1.3.0.tar.bz2 | tar xvf -
cd Twisted-1.3.0 ; sudo python setup.py install
cd ..


wget 
http://internap.dl.sourceforge.net/sourceforge/buildbot/buildbot-0.6.1.tar.gz
tar xvfz buildbot-0.6.1.tar.gz
cd buildbot-0.6.1 ; sudo python setup.py install

sudo useradd -c SpamAssassin Buildbot buildbot
sudo su - buildbot
mkdir -p /home/buildbot/slaves

[now, you need the buildbot password.  ask on the IRC channel
and one of the PMC should be able to set you up with one.]

PASSWORD=[password]

[give your slave a good name, like debian-stable or
ubuntu-hoary-perl585]

HOST_OS=hostname-osname
buildbot slave /home/buildbot/slaves/$HOST_OS bugzilla.spamassassin.org:9989 \
$HOST_OS $PASSWORD

[and mail dev/at/SpamAssassin.apache.org the $HOST_OS string you've
chosen.]

[to start the slave process]

buildbot start /home/buildbot/slaves/$HOST_OS

[to monitor slave progress/errors:]

less /home/buildbot/slaves/$HOST_OS/twistd.log

[to start at boot in future: add this line to crontab:]

@reboot buildbot start /home/buildbot/slaves/hostname-osname


--j.


Re: svn commit: r106600 - /spamassassin/trunk/t/SATest.pm

2004-11-27 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 [EMAIL PROTECTED] writes:
 
  What's the probability that I run into an already used port with the
  new probably_unused_spamd_port() code?  Less than 1 per mill?  Ask
  Murphy...
 
 The only chance of a collision is if the port is listed in
 /etc/services.  My system only has 3 TCP ports above 32768 listed.  So
 if my math is right, that's a 0.003% chance of a collision between
 two processes.  The purely random code had a 0.1% chance of a collision
 between two processes (running at the same time which could happen),
 mostly because it only used 1000 ports.  A 32768-port random version
 would have a 0.003% chance of a collision.
  
  The routine now tries to ask netstat if that port is already in use.
  I tested the pattern on Linux, FreeBSD and Windows.  If netstat can't
  be run, no harm is done, the routine will just work as before.  The
  grep is pretty broad, it might also catch a remote port; then it just
  tries the next random one.  (Hey Murphy, it really can't hit a used
  port ten times, can it?)
 
 I'm not a big fan of shell calls, but it looks (untested) like it'll
 work on Windows XP too.

wow guys -- overkill ;)   I think both approaches are wrong.

Firstly, checking services seems pointless, because if you ask me, there's
actually a *low* likelihood that processes listening on high ports will be
listed in /etc/services at all.  Here's why:

1. I've heard of very few official services on ports  32768 in general.
So I'd surmise that if one is running, the user who started it just picked
a port at random.

2. typically a daemon running on a high port will be something that was
started by a user instead of root, and users don't have write perms on
/etc/services.

Finally, mss' approach is wrong because it's too inefficient, requiring
(another) command be forked every time a t script starts.  easier,
portable way to check if a port is in use: use Socket to connect() to it,
and regenerate a new port if the connect succeeds. No fork overhead, no
portability worries.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBp9mIMJF5cimLx9ARAn4YAJkB/aTNG9Gm/oGcV+53CVwQnWRiEACgtdkE
c/A9EwOAKmpB+b+vmyscqgA=
=MO4Z
-END PGP SIGNATURE-



MIT spam conference

2004-11-27 Thread Justin Mason
looks like it *is* indeed on this year -- http://www.spamconference.org/
CFP ends in 4 days though.

--j.


Re: Restarting MakeMaker development (fwd)

2004-12-01 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Tony Finch writes:
 perl Build.PL install_base=~
  
   This is different from PREFIX in that its not going to try and guess how
   you want things installed based on your system installation.  It's just
   going to plop things into ~/bin, ~/lib, ~/man, etc...  This is much
   saner and easier to predict than PREFIX.
 
 I install SpamAssassin in a non-standard location in order to permit
 multiple parallel installations. This sounds much closer to what I want -
 it's really painful to get MakeMaker to do the right thing.

hmm, are you using the way documented in the INSTALL file?  as far as I
know that should work reliably --

perl Makefile.PL PREFIX=$HOME

I *think* we got that working eventually.  agreed, it was tricky due
to EU:MM wierdness.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBrgY3MJF5cimLx9ARAsQLAKCUwQ4RQEs4h/BOyux7VBlRb6yvYwCeI1Gm
YtvvtrcF3IPaj1ofRas355A=
=f7/Y
-END PGP SIGNATURE-



Re: Cleaning up the test framework

2004-12-02 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Matt Sergeant writes:
 On 1 Dec 2004, at 15:13, Malte S. Stretz wrote:
 
  So I'd like to keyword some of the tests as basic (or whatever
  keyword) and only those tests are run per default.  All other tests
  would be used by us devs, people who we ask to debug one of their bug
  reports aund the BuildBots.
 
  No more options, please.  And there's no reason to speed it up for
  users because users only run make test once in a while.
 
  My idea was that per default all tests are run (except everything which
  requires further set-up or can fail easily like net tests or SQL).
 
 See prove (now shipped with core perl) and Test::Verbose's tv command 
 (which is indispensable to any perl developer IMHO). Also consider 
 adding some tests to CVS but not adding them to the MANIFEST, which 
 will achieve what you require there.

prove is good, and already works with our current t scripts too --
bonus!

Re: adding some tests to CVS but not adding them to the MANIFEST: that
will indeed result in some tests that aren't in the distro but are run
from svn, but that doesn't address the situation where a test needs
extra configuration data (such as LDAP schema to use, LDAP server,
blah blah).

Still, it's a good way to have SVN-only tests.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBr2maMJF5cimLx9ARAnpZAJwNso5hp9fhFfzxulrdO9YG4a/JqACgs3hQ
J/sKng0FFtWQ2AGmJjfG2tw=
=dZZD
-END PGP SIGNATURE-



Re: svn commit: r109552 - /spamassassin/trunk/lib/Mail/SpamAssassin.pm /spamassassin/trunk/spamd/spamd.raw

2004-12-02 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


as a matter of interest -- I guess this is for Daniel -- why is that
debug-area-splitting-and-validation code not part of Mail::SpamAssassin,
anyway?

Looks like it's duplicated in spamd, spamassassin, and sa-learn, which
will result in it getting changed in one and forgotten in the others (as
has just happened here ;)   Code duplication = bad.

- --j.

[EMAIL PROTECTED] writes:
 Author: mss
 Date: Thu Dec  2 13:36:57 2004
 New Revision: 109552
 
 URL: http://svn.apache.org/viewcvs?view=revrev=109552
 Log:
 Made it possible to replace all the warn() kludges in spamd with dbg() or 
 info() calls.
 
 Modified:
spamassassin/trunk/lib/Mail/SpamAssassin.pm
spamassassin/trunk/spamd/spamd.raw
 
 Modified: spamassassin/trunk/lib/Mail/SpamAssassin.pm
 Url:
 http://svn.apache.org/viewcvs/spamassassin/trunk/lib/Mail/SpamAssassin.pm?view=diffrev=109552p1=spamassassin/trunk/lib/Mail/SpamAssassin.pmr1=109551p2=spamassassin/trunk/lib/Mail/SpamAssassin.pmr2=109552
 ==
 --- spamassassin/trunk/lib/Mail/SpamAssassin.pm   (original)
 +++ spamassassin/trunk/lib/Mail/SpamAssassin.pm   Thu Dec  2 13:36:57 2004
 @@ -243,16 +243,8 @@
if (!defined $self) { $self = { }; }
bless ($self, $class);
  
 -  # define debugging facilities first
 -  $INFO = 0;
 -  $DEBUG = 0;
 -  if (defined $self-{debug}  ref($self-{debug}) eq ARRAY) {
 -$facilities{$_} = 1 for @{ $self-{debug} };
 -# turn on informational notices
 -$INFO = 1 if keys %facilities;
 -# turn on debugging if facilities other than info are enabled
 -$DEBUG = keys %facilities  !(keys %facilities == 1  
 $facilities{info});
 -  }
 +  # enable or disable debugging
 +  Mail::SpamAssassin::_init_debugger(ref $self-{debug} eq 'ARRAY' ? @{ 
 $self-{debug} } : ());
  
# first debugging information possibly printed should be the version
info(generic: SpamAssassin version .Version());
 @@ -280,6 +272,25 @@
  
$self;
  }
 +
 +# Do not use this routine in any 3rd-party scripts, it's not part of the
 +# official public API!  spamd needs it though.
 +#
 +# Enables or disables debugging based on the facilities given.  This will
 +# affect ALL SpamAssassin objects!
 +sub _init_debugger {
 +  # define debugging facilities first
 +  $INFO = 0;
 +  $DEBUG = 0;
 +  if (@_) {
 +$facilities{$_} = 1 for @_;
 +# turn on informational notices
 +$INFO = 1 if keys %facilities;
 +# turn on debugging if facilities other than info are enabled
 +$DEBUG = keys %facilities  !(keys %facilities == 1  
 $facilities{info});
 +  }
 +}
 +
  
  sub create_locker {
my ($self) = @_;
 
 Modified: spamassassin/trunk/spamd/spamd.raw
 Url:
 http://svn.apache.org/viewcvs/spamassassin/trunk/spamd/spamd.raw?view=diffrev=109552p1=spamassassin/trunk/spamd/spamd.rawr1=109551p2=spamassassin/trunk/spamd/spamd.rawr2=109552
 ==
 --- spamassassin/trunk/spamd/spamd.raw(original)
 +++ spamassassin/trunk/spamd/spamd.rawThu Dec  2 13:36:57 2004
 @@ -217,7 +217,7 @@
'auto-whitelist|whitelist|a'  = sub { warn The -a option has
been removed.  Please look at the use_auto_whitelist config option
instead.\n; exit 2; },
  
  ) or print_usage_and_exit();
 - 
 +
  if ($opt{'help'}) {
print_usage_and_exit(qq{For more details, use man spamd.\n}, 'EX_OK');
  }
 @@ -226,6 +226,25 @@
exit($resphash{'EX_OK'});
  }
  
 +
 +# Enable debugging, if any areas were specified.  We do this already here,
 +# accessing some non-public API so we can use the convenient dbg() routine.
 +my @DEBUG;
 +if (defined $opt{'debug'}) {
 +  if ($opt{'debug'}) {
 +@DEBUG = split(/,/, $opt{'debug'});
 +if (grep { !/^\S+$/ } @DEBUG) {
 +  warn bad areas in --debug option\n;
 +}
 +  }
 +  else {
 +@DEBUG = (all);
 +  }
 +}
 +# Don't do this at home (aka any 3rd party tools), kids!
 +Mail::SpamAssassin::_init_debugger(@DEBUG);
 +
 +
  # bug 2228: make the values of (almost) all parameters which accept file 
 paths
  # absolute, so they are still valid after daemonize()
  foreach my $opt (
 @@ -728,19 +747,6 @@
  Mail::SpamAssassin::Util::untaint_file_path( $opt{'pidfile'} );
  }
  
 -# set debug areas, if any specified (only useful for command-line tools)
 -my @debug;
 -if (defined $opt{'debug'}) {
 -  if ($opt{'debug'}) {
 -@debug = split(/,/, $opt{'debug'});
 -if (grep { !/^\S+$/ } @debug) {
 -  warn bad areas in --debug option\n;
 -}
 -  }
 -  else {
 -@debug = (all);
 -  }
 -}
  
  my $spamtest = Mail::SpamAssassin-new(
{
 @@ -748,7 +754,7 @@
  rules_filename   = ( $opt{'configpath'} || 0 ),
  site_rules_filename  = ( $opt{'siteconfigpath'} || 0 ),
  local_tests_only = ( $opt{'local'} || 0 ),
 -debug= [EMAIL PROTECTED],
 +debug= [EMAIL PROTECTED],

Re: svn commit: r109710 - /spamassassin/branches/3.0/lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm

2004-12-03 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Theo Van Dinter writes:
 This came up on the list, I considered it trivial enough to just go
 ahead and commit it to the 3.0 branch.  If there's an issue, let me know.
 
 On Fri, Dec 03, 2004 at 05:23:15PM -, [EMAIL PROTECTED] wrote:
  -next unless ($scanner-{conf}-is_rule_active('body_evals',$rulename));
  +next unless ($scanner-{conf}-is_rule_active('body_evals',$rulename) 
  ||
  +
  $scanner-{conf}-is_rule_active('head_evals',$rulename));

meh, fine by me ;)  one-liner.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBsKlaMJF5cimLx9ARAgOpAKCULdhKu/NIf5F45osEeIUEMIsjVQCfYA/v
n7D/7BN7TPP6TfAtblpUcbM=
=5KEh
-END PGP SIGNATURE-



Re: Cron release@bugzilla $HOME/bin/extract_to_rsync_dir nightly /home/corpus-rsync/corpus/nightly-versions.txt $HOME/extract.log

2004-12-03 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Theo Van Dinter writes:
 On Fri, Dec 03, 2004 at 12:58:24AM -0800, Cron Daemon wrote:
  svn: In directory 'nightly_mass_check/rules'
  svn: Can't copy
  'nightly_mass_check/rules/.svn/tmp/text-base/25_razor2.cf.svn-base'
  to 'nightly_mass_check/rules/25_razor2.cf.tmp': No space left on
  device
 
 Oops!
 
 /dev/sda3  7701432   7306716  3500 100% /
 
 The current largest thing is Justin's home directory at 3GB:
 3454580 jm

oops!   lots of old GA run data; mostly collated logs.  nuked.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBsK36MJF5cimLx9ARAlTvAKCuvznccztgnJUOFPXAUPFwZMPlVQCgpRUY
FQ2CsqvvDpg08XKkGYIP7ek=
=HeBW
-END PGP SIGNATURE-



Re: [Bug 4016] New: excessive use of fds

2004-12-04 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


yes *please* ;)

- --j.

Tony Finch writes:
 I've been doing some DNS-intensive work with ADNS recently, and I was
 reminded how fast it is and how easy it is to run bulk jobs with over
 10,000 concurrent DNS queries. You only need two sockets! Maybe I should
 beat Net::DNS to death with the clue bat.
 
 http://www.livejournal.com/users/fanf/
 http://www.chiark.greenend.org.uk/~ian/adns/
 
 Tony.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBsROJMJF5cimLx9ARAi7TAKC9zf8CcIrTf2ePYfmE3h/HTYqLggCgsANo
ILD5nFNFjE7fhdDMhTMmMNk=
=9ITT
-END PGP SIGNATURE-



Re: [Fwd: Re: Addressing wiki vandalism (fwd)]

2004-12-06 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Upayavira writes:
 Sidney Markowitz wrote:
 
  I have created the account SidneyMarkowitz on wiki.apache.org/general.
 
  Please give it access to edit LocalBadContent, as described in the 
  forwarded message below. I'm a SpamAssassin committer.
 
 Now, I don't know who you are, so I can't really add you myself 
 comfortably (without, e.g. a CC to a PMC). However, anyone else who is 
 already on the list can add you by editing 
 wiki.apache.org/general/LeoSimons/AdminGroup. Once your name is on that 
 page, you too will be able to add people.
 
 I guess this is a fair enough approach. Self organising community. Hmm.

indeed, that works nicely ;)  Sidney, you're added.  anyone else,
please CC pmc /at/ SpamAssassin.apache.org when requesting.

hmm: in fact, it may be ok to just email PMC alone, since I/Daniel
can do it now without infrastructure help.  (Upayavira, does
that make sense or would addressing to infra be better?)

- --j.

 Regards, Upayavira
 
   Original Message 
  Subject: Re: Addressing wiki vandalism (fwd)
  Date: Mon, 06 Dec 2004 12:41:42 -0800
  From: [EMAIL PROTECTED]
  To: dev@spamassassin.apache.org
 
 
  FYI -- if you're a committer, please sign up to gain access
  to edit this page -- URLs listed in LocalBadContent will be
  blocked on our wiki.
 
  mail infrastructure /at/ apache.org with a request (once
  you've created a user account on wiki.apache.org/general.)
 
  --j.
 
  --- Forwarded Message
 
  Date:Mon, 06 Dec 2004 20:02:16 +
  From:Upayavira [EMAIL PROTECTED]
  To:  Apache Infrastructure [EMAIL PROTECTED]
  Subject: Re: Addressing wiki vandalism
 
  Justin Mason wrote:
 
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  Leo Simons writes:
 
 
  http://wiki.apache.org/general/LocalBadContent
  
  I've tried to set up some access control so that only users listed on
  
  http://wiki.apache.org/general/LeoSimons/AdminGroup
  
  can edit it (and only users on that page can edit that page).
  
  I'm not sure if that works. Could DavidCrossley or UpayaVira try and
  confirm they can edit, and someone else try and confirm they cannot?
  
  
  My new account there is DanielQuinlan.
  
  
  I'm afraid you'll need to get an account on wiki.apache.org/general as
  well so I can add you to that admingroup page.
  
  
  I've created an account there -- JustinMason.
 
  However, is it wise to restrict who can police the wiki-spam?  I
  don't want to be the lone guy among all the SpamAssassin wiki
  editors who can block a spammer.
 
 
  It completely is necessary to restrict the page. Otherwise a spammer can
  remove his own site! As for a policy, I would have us add any committer
  who asks.
 
  I've added you two, anyway.
 
  Regards, Upayavira
 
  --- End of Forwarded Message
 
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD4DBQFBtOUpMJF5cimLx9ARAtisAJ9R3E90lALqzyGgxTU+/4EvPR0jgQCXXRnt
V6Uu5hklmkTbalNaQ9u4EA==
=Y8Sg
-END PGP SIGNATURE-



Re: [SpamAssassin Wiki] Updated: CommercialNetworkAppliances

2004-12-08 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


[EMAIL PROTECTED] writes:
Date: 2004-12-08T11:22:11
Editor: MrElvey [EMAIL PROTECTED]
Wiki: SpamAssassin Wiki
Page: CommercialNetworkAppliances
URL: http://wiki.apache.org/spamassassin/CommercialNetworkAppliances
 
Justin Mason told me he is an IronPort employee at the FTC Summit last 
 month. 

I suspect that may have been Daniel Quinlan ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBt1cZMJF5cimLx9ARAsS3AJ9O7kSBDlARkSUOoKDbRxlzcMuMbgCgndZm
dzVqAAlT2cIGqQad7ftugow=
=oiHf
-END PGP SIGNATURE-



Re: svn commit: r111767 - /spamassassin/trunk/rules/70_testing.cf

2004-12-14 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


[EMAIL PROTECTED] writes:
 Author: quinlan
 Date: Mon Dec 13 16:39:58 2004
 New Revision: 111767
 
 URL: http://svn.apache.org/viewcvs?view=revrev=111767
 Log:
 remove Flex Hex rules due to low accuracy

what were the results?

- --j.

 Modified:
spamassassin/trunk/rules/70_testing.cf
 
 Modified: spamassassin/trunk/rules/70_testing.cf
 Url:
 http://svn.apache.org/viewcvs/spamassassin/trunk/rules/70_testing.cf?view=diffrev=111767p1=spamassassin/trunk/rules/70_testing.cfr1=111766p2=spamassassin/trunk/rules/70_testing.cfr2=111767
 ==
 --- spamassassin/trunk/rules/70_testing.cf(original)
 +++ spamassassin/trunk/rules/70_testing.cfMon Dec 13 16:39:58 2004
 @@ -444,10 +444,6 @@
  
  
  
 -body T_HTML_COLOR_FLEX_HEX_1 eval:html_test('flex_hex1')
 -body T_HTML_COLOR_FLEX_HEX_2 eval:html_test('flex_hex2')
 -body T_HTML_COLOR_FLEX_HEX_3 eval:html_test('flex_hex3')
 -
  body T_HTML_TAG_EXIST_BGSOUNDeval:html_tag_exists('bgsound')
  
  body T_HTML_IMAGE_SIZE_ZERO  eval:html_test('image_size_zero')
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBvjfoMJF5cimLx9ARAqmeAJoC1LsFiyy7oAgPh2cn5hzSIwKuPQCbBBqg
GPRmVGi65Qmr255n9XfEjQc=
=hWvP
-END PGP SIGNATURE-



Storable and hyperthreading

2004-12-15 Thread Justin Mason
OK, so on the spamd hang bugs, we have:

- a set of people reporting hangs predominantly (all?) when running spamd
  on hyperthreaded CPUs
- not all HT CPUs are acting up
- a hang traced into Storable::dclone() (thanks Dallas!)

so I think we may have run into a perl thread-safety bug, possibly
in Storable, possibly at a lower level, and running on HT cpus causes
this bug to manifest itself.

Another reason to get rid of our use of Storable, in my opinion.

--j.


Re: YOU ARE ON THE WAY TO DESTRUCTION

2004-12-16 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Michael Parker writes:
 On Wed, Dec 15, 2004 at 04:25:29PM -0800, Daniel Quinlan wrote:
  Bugzilla says we can release 3.0.2 so I therefore propose we release 3.0.2.
  
 
 +1 for release, all tests pass on several of my machines.

+1, if we're all clear, let's go for it; I'm not going to hold for 3828
in that case.
(btw I get:

  Like they said at NASA - Better, faster, cheaper - you get to pick two.

appropriate!)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBwNwaMJF5cimLx9ARAizdAJ4mS7zwqf2x977B0HZ1P+bM7uRkwwCgkGxl
b7c17YUNX8XcaUovroTQT4U=
=etJM
-END PGP SIGNATURE-



Re: YOU ARE ON THE WAY TO DESTRUCTION

2004-12-16 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Theo Van Dinter writes:
 It's moot at this point, and I meant to vote -0.5 and not -1 BTW.
 I wanted more of a I'd rather not yet versus NFW.  It was more of
 a shock of nothing at all for weeks about a release, then suddenly,
 the one night I'm not sitting online, a bunch of stuff happens and a
 release occurs.

I suspect something must have happened in IRC.  I wasn't there
either ;)

 Anyway, as for reasoning -- I have been having conversations with the
 Habeas folks to get this code/support into 3.0.2.  Per my last message
 I've already explained how since there wasn't any discussion for weeks
 now wrt a release, there wasn't extreme urgency in doing a code review.
 Had I known there was going to be a release tonight/this week/this
 month, I would have made an effort to free up enough time to do the
 review beforehand.

ouch.  agreed, that's not too hot, but it's really more the fault
of the 3.0.2-vs-Future slip-up...

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBwd0MMJF5cimLx9ARAof2AKCDONL0rxTpUp1PAz235m3+yrG8oQCcC2cT
CUrT/jr91f0fXtjkqqC3Lw4=
=ObJV
-END PGP SIGNATURE-



Re: svn commit: r122529 - /spamassassin/trunk/lib/Mail/SpamAssassin/Reporter.pm

2004-12-16 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


[EMAIL PROTECTED] writes:
 Author: felicity
 Date: Wed Dec 15 22:25:05 2004
 New Revision: 122529
 
 URL: http://svn.apache.org/viewcvs?view=revrev=122529
 Log:
 got a syntax error doing reporting.  also, no point in doing regexp since 
 we're looking for explicit strings, just use eq.

what about the newline?

- --j.

 Modified:
spamassassin/trunk/lib/Mail/SpamAssassin/Reporter.pm
 
 Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Reporter.pm
 Url:
 http://svn.apache.org/viewcvs/spamassassin/trunk/lib/Mail/SpamAssassin/Reporter.pm?view=diffrev=122529p1=spamassassin/trunk/lib/Mail/SpamAssassin/Reporter.pmr1=122528p2=spamassassin/trunk/lib/Mail/SpamAssassin/Reporter.pmr2=122529
 ==
 --- spamassassin/trunk/lib/Mail/SpamAssassin/Reporter.pm  (original)
 +++ spamassassin/trunk/lib/Mail/SpamAssassin/Reporter.pm  Wed Dec 15 
 22:25:05 2004
 @@ -239,9 +239,9 @@
  
if ($err) {
  alarm $oldalarm;
 -if ($err =~ /^__alarm__$/) {
 +if ($err eq '__alarm__') {
dbg(reporter: pyzor report timed out after $timeout seconds);
 -} elsif ($err /^__brokenpipe__$/) {
 +} elsif ($err eq '__brokenpipe__') {
dbg(reporter: pyzor report failed: broken pipe);
  } else {
warn(reporter: pyzor report failed: $err\n);
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBwds7MJF5cimLx9ARAii3AJ9P6ssd2Qbh47kImDy0Ns0w01wxpACeP374
DEDBV1jX/5zg4+qO3fgxCgI=
=uN5K
-END PGP SIGNATURE-



Re: buildbot failure in [...]

2004-12-17 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Malte S. Stretz writes:
 On Friday 17 December 2004 11:36 CET [EMAIL PROTECTED] 
 wrote:
  The Buildbot has detected a new failure of trunk-debian-stable.
 
  Buildbot URL: http://bugzilla.spamassassin.org:8010/
 
  Build Reason: changes
  Build Source Stamp: 112
  Blamelist: quinlan
 
  BUILD FAILED: failed svn
 
 Those messages are getting a bit annoying, is there any way to filter any 
 builtbot message which contains BUILD FAILED: failed svn on the server?

no!

(a) they really are failures.  in this case the svn server seems to have
died, which is good to know ;)   The whole point of this is to get
notification of failures.

(b) however the -parker- and -sidney- ones *are* getting annoying. ;)  I
suggest we turn off those slaves until we can figure out how to get
buildbot to work with dynamic-IP slaves...

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBwyH4MJF5cimLx9ARAuMbAJ9SSnez7MSgQtUsq9JlKFnP6t8EEACfYqZo
Cnd/J6zOu6Gqe6h+HHXvKQE=
=X3Rp
-END PGP SIGNATURE-



Re: buildbot failure in [...]

2004-12-17 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Sidney Markowitz writes:
 Justin Mason wrote:
  Sidney, have you tried setting  --keepalive=300
 
 I'll try that. What Michael says does make sense. I'm behind a NAT.
 
 Is there a way of setting a port that the slave listens on? I can 
 configure my NAT to let the slaves be designated servers on some port if 
 I can make it a fixed port and assign a different port number to each of 
 them. I'm sure if it is possible I could find it by RTFM, but I have not 
 had a lot of time to learn about buildbot and twistd.

hmm -- I don't think the slaves *ever* listen on a port -- instead they
open a conn _out_ to the master.

 By the way I have to call twistd directly instead of buildbot in order 
 to get everything to work in Cygwin and Win32. They need the -n option 
 in order to run, and in Win32 I have to give it the -r win32, which I 
 would have expected to be automatic when running a win32 buildbot.
 
 Cygwin command: twistd -l - -n -f ../buildbot.tap
 Win32 command:  twistd -l - -n -r win32 -f ..\buildbot.tap

might be worth signing up to buildbot-devel (it's very low traffic)
and mention that...

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBw0hJMJF5cimLx9ARAtKfAKCCBDuRXE15qvY/xtcCaH5j0IYdDgCdHKAq
CXAnBD9iVkyT8uuiNhIKzDs=
=1otV
-END PGP SIGNATURE-



Re: buildbot failure in [...]

2004-12-17 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Sidney Markowitz writes:
 Malte S. Stretz wrote:
  Does anybody know what exactly goes wrong?  Maybe it could work if we use 
  port forwarding or stunnel or something to route the traffic to the dynamic 
  clients over some server with a static IP?
 
 Here's my last svn failed log. It was on the native machine, and I just 
 discovered that the two slaves on the VMWare virtual machine have been 
 not responding for a couple of days, so it cannot be a matter of 
 simultaneous access on the same machine. I'm going to try to restart them.
 
 Could the svn server be sensitive to too many clients hitting the same 
 repository at the same time? Perhaps it would help to introduce a delay 
 between triggering one slave and the next, or if that is not possible 
 adding a sleep of a random time on the slaves before the svn update.

I doubt that's it.  First off, the svn failed logs were the same on
all slaves as of the last svn checkin -- see

http://bugzilla.spamassassin.org:8010/trunk-red-hat-7.3/builds/89/svn/0
http://bugzilla.spamassassin.org:8010/reqd-modules-only-5.8.1/builds/76/svn/0

both are running on the buildbot master machine as well.  that's just
because the SVN server was borked.

Secondly, I have 4 slaves (a) started simultaneously and (b) hitting
the repo simultaneously, on the buildbot machine.  And if you look at
15:31:38 on Thu Dec 16, you can see 7 slaves hitting svn simultaneously,
and all passing.  So that's not it.

Basically we have:

- buildbot master host, localhost, no NAT: 5 slaves, always pass
- jm: 1 slave, static IP, no NAT: debian-stable, always passes
- parker: 3 slaves, behind NAT: frequent failures
- sidney: 3 slaves, NAT?: frequent failures

I think it's the NAT that causes the issue, and therefore the keepalive
idea is the best bet...

BTW bear in mind that the slaves are never connected *to*.  Instead, they
operate by opening a TCP connection to the master at startup, and
receiving commands pushed to them via that.  if that TCP conn dies, they
disappear, and retry connections very slowly, like once every 10 mins with
exponential backoff.

- --j.

   -- sidney
 
 The log:
 
 starting svn operation
 command '['svn', 'update', '--revision', '122631']' in dir 
 /b/home/buildbot/slaves/sidney-fedora3/trunk-sidney-fedora3/build 
 (timeout 1200 secs)
 svn: PROPFIND request failed on '/repos/asf/spamassassin/trunk'
 svn: PROPFIND of '/repos/asf/spamassassin/trunk': Could not read status 
 line: connection was closed by server. (http://svn.apache.org)
 update failed, clobbering and trying again
 command '['rm', '-rf', 
 '/b/home/buildbot/slaves/sidney-fedora3/trunk-sidney-fedora3/build']' in 
 dir /b/home/buildbot/slaves/sidney-fedora3/trunk-sidney-fedora3 (timeout 
 1200 secs)
 now retrying VC operation
 command '['svn', 'checkout', '--revision', '122631', 
 'http://svn.apache.org/repos/asf/spamassassin/trunk', 'build']' in dir 
 /b/home/buildbot/slaves/sidney-fedora3/trunk-sidney-fedora3 (timeout 
 1200 secs)
 svn: PROPFIND request failed on '/repos/asf/spamassassin/trunk'
 svn: PROPFIND of '/repos/asf/spamassassin/trunk': Could not read status 
 line: connection was closed by server. (http://svn.apache.org)
 program finished with exit code 1
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBw1XoMJF5cimLx9ARAmbsAJ0QFRYByCiQ4WY6K47wN/E7wxru0ACeOHNj
JTOK7lD2BWBdKwyF7DPs0sM=
=xqJz
-END PGP SIGNATURE-



Re: buildbot failure in [...]

2004-12-17 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 [EMAIL PROTECTED] (Justin Mason) writes:
 
  Well, I think failed svn is something that all build failures produce.
  Even if the problem is a bug, rather than an svn timeout...
 
 I think we should remove (a) all of the NATed slaves and (b) any build
 server that can't reliably connect to the server.  I'm already ignoring
 all failures, so the purpose of the build system has beenn completely
 lost.  It's more important to have reliable build hosts than maintain
 the excessive build host diversity that we have right now.

agreed, to be honest...  Sidney, Michael, what do you think?

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBw2HuMJF5cimLx9ARAgvTAKCAR41dM/Ch8Ug0FG0acfWeHOpRHQCfZPQY
RmmeWBs/GxxEnow3wJ6NhJo=
=0BQo
-END PGP SIGNATURE-



Re: RFC: New Plugin Hook

2004-12-18 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


makes sense to me.   I'd (a) expand the doco, and (b) use a better
name than verify_user for the method, as it took a while for me
to grok it.

rather than verify_user, how's about service_acl_allows_username or
similar?

- --j.

Michael Parker writes:
 Howdy,
 
 I was looking at a possible solution to:
 http://bugzilla.spamassassin.org/show_bug.cgi?id=3215
 
 and decided that it could be done pretty easily if I had a new plugin
 hook.  So, I created one, I wanted to get y'all opinion on it before I
 went forward.
 
 The new plugin is verify_user which takes a services hash and a
 username as input.  The plugin is responsible for a) making sure it is
 supposed to handle one of the passed in services and b) that the
 username is allowed/whatever to use the service.
 
 Please see the enclosed diff and sample plugin that implements the
 feature.  Obviously, this is a bayessql specific case, but I could see
 it being used in other areas of the code.  You could have multiple
 plugins, that handled different authorization methods.
 
 Comments?
 
 Michael
 
 Here is the diff:
 Index: lib/Mail/SpamAssassin/BayesStore/SQL.pm
 ===
 --- lib/Mail/SpamAssassin/BayesStore/SQL.pm   (revision 122598)
 +++ lib/Mail/SpamAssassin/BayesStore/SQL.pm   (working copy)
 @@ -140,7 +140,7 @@
}
  
unless ($self-_initialize_db()) {
 -dbg(bayes: database entry for .$self-{_username}. not found);
 +dbg(bayes: unable to initialize database for .$self-{_username}. 
 user, aborting!);
  $self-untie_db();
  return 0;
}
 @@ -1733,6 +1733,20 @@
  
return 0 if (!$self-{_username});
  
 +  # Check to see if we should call the verify_user plugin hook to see if this
 +  # user is allowed/able to use bayes.  If not, do nothing and return 0.
 +  if ($self-{bayes}-{conf}-{bayes_sql_verify_user}) {
 +my $services = { 'bayessql' = 0 };
 +$self-{bayes}-{main}-call_plugins(verify_user, { services = 
 $services,
 +   username = 
 $self-{_username},
 + });
 +
 +unless ($services-{bayessql}) {
 +  dbg(bayes: username not verified by verify_user plugin);
 +  return 0;
 +}
 +  }
 +
my $sqlselect = SELECT id FROM bayes_vars WHERE username = ?;
  
my $sthselect = $self-{_dbh}-prepare_cached($sqlselect);
 Index: lib/Mail/SpamAssassin/Plugin.pm
 ===
 --- lib/Mail/SpamAssassin/Plugin.pm   (revision 122598)
 +++ lib/Mail/SpamAssassin/Plugin.pm   (working copy)
 @@ -219,6 +219,34 @@
  
  =back
  
 +=item $plugin-verify_user ( { options ... } )
 +
 +=over 4
 +
 +=item services
 +
 +Reference to a hash containing the services you want to check.
 +
 +In order to verify a user, the plugin should first check that the
 +service it is handling exists in the hash and then set the value
 +of the service to a postive value if the username is verified/validated
 +for that service.
 +
 +The current supported services are:
 +
 +=over 4
 +
 +=item bayessql
 +
 +=back
 +
 +
 +=item username
 +
 +A username
 +
 +=back
 +
  =item $plugin-check_start ( { options ... } )
  
  Signals that a message check operation is starting.
 Index: lib/Mail/SpamAssassin/Conf.pm
 ===
 --- lib/Mail/SpamAssassin/Conf.pm (revision 122598)
 +++ lib/Mail/SpamAssassin/Conf.pm (working copy)
 @@ -2719,6 +2719,28 @@
  type = $CONF_TYPE_STRING
});
  
 +=item bayes_sql_verify_user (0 | 1)  (default: 0)
 +
 +Whether to call the verify_user plugin hook in BayesSQL.  If the hook
 +does not determine that the user is allowed to use bayes or is invalid
 +then then database will not be initialized.
 +
 +NOTE: By default the user is considered invalid until a plugin returns
 +a true value.  If you enable this, but do not have a proper plugin
 +loaded, all users will turn up as invalid.
 +
 +The username passed into the plugin can be affected by the
 +bayes_sql_override_username config option.
 +
 +=cut
 +
 +  push (@cmds, {
 +setting = 'bayes_sql_verify_user',
 +is_admin = 1,
 +default = 0,
 +type = $CONF_TYPE_BOOL
 +  });
 +
  =item user_scores_dsn DBI:databasetype:databasename:hostname:port
  
  If you load user scores from an SQL database, this will set the DSN
 
 Here is the sample plugin:
 package Mail::SpamAssassin::Plugin::VerifyUser;
 
 =pod
 
 This is a sample plugin, it may not work at all, so buyer beware.
 
 It also uses an experimental plugin hook, that may or may not be
 supported.
 
 The groupfile for this feature looks something like:
 
 bayessql: parker foobar1 foobar2
 
 =cut
 
 use Mail::SpamAssassin::Plugin;
 use strict;
 use bytes;
 
 use Apache::Htgroup;
 
 use vars qw(@ISA);
 @ISA = qw(Mail::SpamAssassin::Plugin);
 
 use constant GROUPFILE = 

Re: RFC: New Plugin Hook

2004-12-18 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Michael Parker writes:
 On Fri, Dec 17, 2004 at 05:38:26PM -0800, Justin Mason wrote:
  
  ok -- service_allowed_for_username -- there's only one service for
  each call. ;)
  
 
 Why put that sort of restriction?
 
 what if I wanted something like:
 
 $services = { 'bayessql' = 0,
   'awl' = 0,
   'awlsql' = 0,
   'allow_user_rules' = 0,
   'etc' = 1 }
 
 I've implemented it as a single service in BayesSQL but there is no
 reason why you couldn't move the plugin call to a higher level and
 pass in ALL of the services you are interested in.

ah, missed that.  ok, makes sense. that should probably be called
out specifically in the doco...

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBw40DMJF5cimLx9ARAg+FAJwMj5den+U4I/bZTNvAklNewwDaOwCeJM9t
mMa/L9IbllFxsnP4Ykx3fKE=
=/Tnl
-END PGP SIGNATURE-



Re: RFC: New Plugin Hook

2004-12-18 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Michael Parker writes:
 On Fri, Dec 17, 2004 at 04:41:51PM -0800, Justin Mason wrote:
  
  makes sense to me.   I'd (a) expand the doco, and (b) use a better
  name than verify_user for the method, as it took a while for me
  to grok it.
  
  rather than verify_user, how's about service_acl_allows_username or
  similar?
  
 
 Opps, missed the whole what this thing does blurb in the POD.
 
 I'm horrible at naming things, how about
 services_allowed_for_username?

ok -- service_allowed_for_username -- there's only one service for
each call. ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBw4oSMJF5cimLx9ARAhMtAJwN4GSgjguoknqE5xN7N+pzh1CpUgCfR3od
9gPZvN1mY7cG9TnmawXKVWc=
=QVqw
-END PGP SIGNATURE-



Re: svn commit: r124477 - /spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pm /spamassassin/trunk/rules/20_body_tests.cf /spamassassin/trunk/rules/70_testing.cf

2005-01-07 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


[EMAIL PROTECTED] writes:
 Author: quinlan
 Date: Fri Jan  7 00:06:07 2005
 New Revision: 124477
 
 URL: http://svn.apache.org/viewcvs?view=revrev4477
 Log:
 promote T_BAD_ISO_CHARSET to MIME_BAD_ISO_CHARSET, but convert it to an
 eval function to avoid using a full test

we should really figure out some way to expose those in-body MIME headers
in a new rule type...

- --j.

 Modified:
spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pm
spamassassin/trunk/rules/20_body_tests.cf
spamassassin/trunk/rules/70_testing.cf
 
 Modified: spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pm
 Url:
 http://svn.apache.org/viewcvs/spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pm?view=diffrev4477p1=spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pmr14476p2=spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pmr24477
 =---
  spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pm(original)
 +++ spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pm Fri Jan  7 
 00:06:07 2005
 @@ -2353,6 +2353,12 @@
  $self-{mime_base64_no_name} = 1;
}
  
 +  if ($charset =~ /iso-\S+-\S+\b/i 
 +  $charset !~ /iso-(?:8859-\d{1,2}|2022-(?:jp|kr))\b/)
 +  {
 +$self-{mime_bad_iso_charset} = 1;
 +  }
 +
# MIME_BASE64_LATIN: now a zero-hitter
# if (!$name 
# $cte =~ /base64/ 
 @@ -2414,7 +2420,7 @@
   || ($name eq xls  $ctype !~ [EMAIL PROTECTED]/.*excel$@)
 )
  {
 -   $self-{mime_suspect_name} = 1;
 +  $self-{mime_suspect_name} = 1;
  }
}
  }
 
 Modified: spamassassin/trunk/rules/20_body_tests.cf
 Url:
 http://svn.apache.org/viewcvs/spamassassin/trunk/rules/20_body_tests.cf?view=diffrev4477p1=spamassassin/trunk/rules/20_body_tests.cfr14476p2=spamassassin/trunk/rules/20_body_tests.cfr24477
 =---
  spamassassin/trunk/rules/20_body_tests.cf(original)
 +++ spamassassin/trunk/rules/20_body_tests.cf Fri Jan  7 00:06:07 2005
 @@ -123,6 +123,9 @@
  body MPART_ALT_DIFF_COUNT
 eval:multipart_alternative_difference_count('3', '1')
  describe MPART_ALT_DIFF_COUNTHTML and text parts are different
  
 +body MIME_BAD_ISO_CHARSETeval:check_for_mime('mime_bad_iso_charset')
 +describe MIME_BAD_ISO_CHARSETMIME character set is an unknown ISO 
 charset
 +
  ###
  
  body CHARSET_FARAWAY eval:check_for_faraway_charset()
 
 Modified: spamassassin/trunk/rules/70_testing.cf
 Url:
 http://svn.apache.org/viewcvs/spamassassin/trunk/rules/70_testing.cf?view=diffrev4477p1=spamassassin/trunk/rules/70_testing.cfr14476p2=spamassassin/trunk/rules/70_testing.cfr24477
 =---
  spamassassin/trunk/rules/70_testing.cf   (original)
 +++ spamassassin/trunk/rules/70_testing.cfFri Jan  7 00:06:07 2005
 @@ -354,11 +354,4 @@
  
  
  
 -# bug 4054: contributions from Maxime Ritter (airmax.cf)
 -
 -# only works on full, may be better to check in Message object for this
 -full __ISO_VALID 
 /charset=\?iso-(?:8859-\d{1,2}|2022-(?:jp|kr))\b/i
 -full __ISO_CHARSET   /charset=\?iso-\S+-\S+\b/i
 -meta T_BAD_ISO_CHARSET   (__ISO_CHARSET  !__ISO_VALID)
 -
  body T_NORMAL_HTTP_TO_IP eval:check_numeric_http()
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB3t6EMJF5cimLx9ARAh0CAJ9UL1xcUI/yBjRzgE63oAXdyflc8gCcD0NC
FtfNG2YkwDEO6I7zMNzoygY=
=01eO
-END PGP SIGNATURE-



Re: svn commit: r124477 - /spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pm /spamassassin/trunk/rules/20_body_tests.cf /spamassassin/trunk/rules/70_testing.cf

2005-01-07 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 [EMAIL PROTECTED] (Justin Mason) writes:
 
  we should really figure out some way to expose those in-body MIME headers
  in a new rule type...
 
 I was thinking the same thing.

oh good, so you've changed your mind since
http://bugzilla.spamassassin.org/show_bug.cgi?id=3781#c3  then ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB3vX1MJF5cimLx9ARAqTyAJ0ZndmkmF/cHzTpWZ3FESQKr/wydgCfZfpa
zZ+TYtYtFoXTZW27fS2Rfms=
=yNd+
-END PGP SIGNATURE-



Re: rules needing work

2005-01-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 Rules with the largest RANK drops from 3.0 to now.  The body and
 Subject: ones probably need the most work.  The rest are probably lost
 causes.

I suggest we drop them, if their RANK is sufficiently low and nobody steps
up to fix them; it's about time some rules (at least body rules that is)
finally got deleted from the default ruleset ;)

I don't have strong feelings about any apart from ALL_TRUSTED.

- --j.

 broken!
 
 -0.17 ALL_TRUSTED
 
 work needed:
 
 -0.32 UNIVERSITY_DIPLOMAS
 -0.26 STOCK_PICK
 -0.18 STOCK_ALERT
 -0.15 SUBJECT_DRUG_GAP_S
 -0.14 STRONG_BUY
 -0.14 DEEP_DISC_MEDS
 -0.13 DRUGS_PAIN
 -0.11 REVERSE_AGING
 -0.11 BANG_OPRAH
 -0.1  NO_CREDIT_CHECK
 -0.1  WE_HONOR_ALL
 
 not sure:
 
 -0.34 HTML_NONELEMENT_50_60
 -0.23 HELO_DYNAMIC_OOL
 -0.21 HTML_BADTAG_20_30
 -0.2  MIME_HTML_ONLY_MULTI
 -0.2  HTML_NONELEMENT_40_50
 -0.19 HTML_FONT_SIZE_NONE
 -0.19 HTML_FONT_SIZE_TINY
 -0.19 HTML_BADTAG_30_40
 -0.18 HTML_BADTAG_90_100
 -0.16 HDR_ORDER_TRIMRS
 -0.16 X_ORIG_IP_NOT_IPV4
 -0.11 HTML_NONELEMENT_90_100
 -0.11 HELO_DYNAMIC_ATTBI
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB4gH+MJF5cimLx9ARAqnmAJ4k187nq9W0n0BJW5+rD5ig69FUDgCgjD/z
s1PNnWjFZWB1Q8+oD2rnKUk=
=3jnH
-END PGP SIGNATURE-



Re: initial analysis of SPF_PASS results

2005-01-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 First, large ISPs seem to be the origination point for a *lot* of spam.

Large ISPs' outbound relays, or direct from their dynamic pools?
e.g. blueyonder.co.uk list their dyn pools in their SPF record,
which is unfortunate but legal.

 Second, here's my list of the domains we could potentially whitelist for
 SPF_PASS results (high count, good ratio, not biased towards open source
 folks).
 
 0.  90  health.webmd.com
 0.  27  foolsubs.com
 0.  23  ms3.lga2.nytimes.com (list *.nytimes.com ?)
 0.  17  match.com
 0.  9   paypal.com

+1 -- I can go for that.

(Worth noting that I *don't* think we should also apply the converse,
treating mails from those doms that don't fix the SPF record as forged;
we'd need to do separate analysis on that.)

 For a different and even less biased approach, I took the listings with
 0.01 or lower S/O ratio and ranked them by SenderBase volume (entries
 above 6.0 on the volume scale).  Note that I just extracted
 registrar-level domain names from the SPF domain lists, so some of these
 are definitely not completely clean or are not immediately
 whitelistable.
 
 domain  volume  whitelist?
 --  --
 ebay.com7.5 yeah
 amazon.com  6.7 yeah
 speakeasy.net   6.6
 paypal.com  6.6 yeah
 msn.com 6.6
 roving.com  6.5
 nytimes.com 6.5 yeah
 m0.net  6.5
 classmates.com  6.5
 exacttarget.com 6.4
 sparklist.com   6.2
 sourceforge.net 6.1
 securityfocus.com   6.1
 spamarrest.com  6.0
 rm04.net6.0
 redhat.com  6.0
 foolsubs.com6.0 yeah
 bluehornet.com  6.0
 
 So, based on all that, I'm thinking we could experimentally add SPF_PASS
 whitelists for:
 
   ebay.com
   amazon.com
   paypal.com
   nytimes.com
   foolsubs.com
   webmd.com
   match.com
 
 I checked NANAE and the above domans seem to be pretty clean and this
 jives with my recollection.

+1.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB4gLNMJF5cimLx9ARAn3CAKC7V80ycFkJrP+8bE3oP2T85VQ4NwCgi5t6
GdGMdM89ze4fvC/9l/uDdJ0=
=jXd3
-END PGP SIGNATURE-



Re: Target Milestone of Future is harmful

2005-01-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Thomas Schulz writes:
 I would like to suggest that having a Target Milestone of Future for a
 bug is harmful.  It was probably necessary when you were trying to get
 3.0.0 out and you were not sure what the next verson number would be,
 but now it seems to be a way for a bug to fall into a black hole.  It
 seems that if a bug is not grabbed by someone within a few hours of
 being submitted, it is lost.

It's a manageability thing.  We don't have someone who can sit
there continually reprioritising bugs :(

I suggest that if you have bugs with TM set to Future, and you think
they're implementable sooner ;) -- feel free to post a comment and pipe
up. In particular, getting a patch that implements the feature is a *lot*
more likely to get a bug a solid milestone.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB4sycMJF5cimLx9ARAkTQAJ9rgwfZb2/vfyt9fjkNc5McdUdRCwCgifvP
0o8X6l0A6wBmqck+mU2Hh/E=
=b1up
-END PGP SIGNATURE-



Re: [Bug 4072] SPF_PASS false match

2005-01-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
  Do we want to fix this for 3.0.3, or just leaving it for 3.1?
 
 I'm okay with backporting, but I think we're nearing the point at which
 we buckle down on 3.1 and focus on it.  The tree is remarkably stable
 right now and there are a number of significant improvements, so I'm
 starting to think about actually sticking to the aggressive schedule
 that was proposed a few months ago.  :-)

+1.   I think we should only do a 3.0.3 if something serious (security,
data loss, etc.) comes up.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB4v/QMJF5cimLx9ARAgkkAJ0SuiBX6eOw2mWHKXZw6K3FR603ugCdE+kW
ekq0zXOQGaGEGuYaMZo7adk=
=YWNa
-END PGP SIGNATURE-



Re: IP_IN_RESERVED_RANGE = IP_PRIVATE

2005-01-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 I'd like to make IP_IN_RESERVED_RANGE go away.  In an ideal world, but I
 know Justin will object so I won't propose it, I would nuke it.  Since
 it's possible some poor unsuspecting third-party plugin is using it in
 the same brokey was as our code was just yesterday, I propose we merely
 set it equal to the new IP_PRIVATE constant.
 
 If you read the comment:
 
 # Initialize a regexp for reserved IPs, i.e. ones that could be
 # used inside a company and be the first or second relay hit by
 # a message. Some companies use these internally and translate
 # them using a NAT firewall. These are listed in the RBL as invalid
 # originators -- which is true, if you receive the mail directly
 # from them; however we do not, so we should ignore them.
 
 That's how it's defined anyway -- an internal address.
 
 Does that sound okay?

+1
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB4yRIMJF5cimLx9ARAkCAAJ9vNTJlev+8pMy3GzxFlQP8lntCDQCfQxVN
L9k1HDxGShMVwoitylFZ5jY=
=xmiE
-END PGP SIGNATURE-



Re: svn commit: r124477 - /spamassassin/trunk/lib/Mail/SpamAssassin/EvalTests.pm /spamassassin/trunk/rules/20_body_tests.cf /spamassassin/trunk/rules/70_testing.cf

2005-01-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Loren Wilton writes:
  oh good, so you've changed your mind since
  http://bugzilla.spamassassin.org/show_bug.cgi?id781#c3  then ;)
 
 Somewhat.  I still think it should be a plugin.
 
 There's a problem with plugins I hadn't realized when they were
 originally being advertized as the universal solution to oddball
 rules.  The problem is that they aren't.
 
 Anyone can write a jive rule, if allow_user_rules is set.  But nobody
 but the system administrator can install a plugin.  And it seems that
 even invocations of a plugin aren't supposed to show up in the
 user_prefs file, even with allow_user_rules.  So to be useful here
 (for the general case, which is what interests me) this would have to
 be a plugin that effectively exported a new rule base name, and the
 plugin would then take a general re against that base type.
 
 Which is the same as inventing a new rule base type, except that not as many 
 people will be able to use it.

I'm not sure what you mean here.  could you add some examples?

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB4yycMJF5cimLx9ARAhEuAJ4klZ5AO0iSpMPZ2UtESkN26xX+iACgtyyG
6uGqWL2y8o0ozYhB5hnrjQE=
=J/yK
-END PGP SIGNATURE-



Re: BZ box being hammered

2005-01-17 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Theo Van Dinter writes:
 I was noticing that the BZ box gets hammered, at the moment due to buildbot:
 If we're going to run buildbot on there, it should at least be done in serial
 and not parallel.

not sure how easy that is :(   we could add some hackery to the
buildbot-slave configuration...

 I'm also not sure that it's the best box for a mass-check either, but they're
 having issues at the moment it looks like:

ick.   I need a new box ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB6wyIMJF5cimLx9ARApfEAJ9zPp6o1U2nZWLICmXM5dDgbcKavgCaAlaJ
Oy12Tk0TVAb0kRfjgaMxZqQ=
=/eCF
-END PGP SIGNATURE-



Re: svn commit: r125369 - /spamassassin/trunk/rules/70_scraped.cf

2005-01-17 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 Actually, can I suggest a different naming convention?
 
   T_bug number_test name
 
 For example,
 
   T_4081_FOO_BAR
 
 Short and easy to look up.  Also, does your code handle naming for
 new/overlapping/existing predicates for meta rules?

it's a hornet's nest ;)

- - 1. T_ prefix will overlap with our own T_ prefix.  I'd prefer a new
  prefix to keep them separate. T_MC_ might work, but I used MC_ here.
  some votes on what people would prefer are welcome ;)

- - 2. there may be multiple revisions of rules and predicates with the same
  name inside one bug number, as in bug 2243, so it'd need scoping by both
  bug number and comment number; scoping by bug number alone will fail in
  this situation.

- - 3. there may be new rules using *existing* rules as meta predicates, so
  it can't rename all rulenames found; just the ones where the rules are
  defined in that comment's ruleset.

Current code does all those 3.  However it does have a failing:

- - It cannot deal with the case (as in bug 2243 comment 14) where a set of
  new rules are posted that use rules from a previous comment (cmt 13) as
  predicates. However I can't see a good way to deal with that case,
  without breaking the case where a new comment revises a ruleset from an
  earlier comment, without suffering rule name clashes. So I think that's
  an acceptable limitation.

Finally, naming:

Due to our absurd rule-name-length limitation policy ;), we cannot do the
sensible thing, which would indeed be:

  MC_{bugnum}_{cmtnum}_{rulename}

but current naming scheme is:

  MC_{rulename}_{rnd}

where {rnd} is a 3-digit random number.   (the idea is that hopefully
the rulename will be short enough to scrape past the length limit,
since --lint is used to ensure rules are valid before they're checked
into 70_scraped.cf.)

My current plans are to fix this by removing that stupid limit on rulename
and description lengths.   They've caused WAY, *WAAAY* more problems than
they solved and I'm sick to death of them! :(Some sensible wrapping
code would be simpler, and save EVERYONE a lot of trouble.

Once I do that I'll fix automc to use the sensible naming scheme.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB6w7mMJF5cimLx9ARAvGyAKCa1YzsmAiNZUvYWpz37TxRpaxuFQCghjok
jsnMuww1KTH67V5PWOlXF3c=
=8MAI
-END PGP SIGNATURE-



another bz error

2005-01-17 Thread Justin Mason
on committing comment 11 to bug 4058:

Internal Error

Bugzilla has suffered an internal error. Please save this page and send it to 
dev@spamassassin.apache.org with details of what you were doing at the time 
this message appeared.

URL: http://bugzilla.spamassassin.org/process_bug.cgi
undef error - Can't find param named messageid at Bugzilla/Config.pm line 150. 

--j.


Re: svn commit: r125477 - /spamassassin/trunk/rules/70_scraped.cf

2005-01-19 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 Justin, can you change a few things in addition to the names?
 
   - check-in via some role account

asking infrastructure about this...

   - don't rename the tests every time - this is maddening

ok, let's see what I can do there.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB7cdKMJF5cimLx9ARAoUaAKClwGO5Kjp9VWebu5z9wm+xtpJfyQCfXT8O
CI9yFfaaKm1TG335tTM/UkY=
=vLHh
-END PGP SIGNATURE-



Re: real-time network results

2005-01-19 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 Theo Van Dinter [EMAIL PROTECTED] writes:
 
  Yeah, that mostly sums up my feelings.  The current RBL information tells
  us when a positive lookup occurs, but not when a negative lookup occurs.
 
 True, you have to assume a negative lookup if it doesn't show and the
 reuse mapping indicates it was present.  I'll provide a way for people
 to disable reuse for rules that they normally don't run with.

something in the mass-check user_prefs file, maybe.

 Note that even if some of those non-hits are due to downtime or timeouts
 or whatever, those *should* be considered as the realtime result since
 they affect accuracy.

yes, that's very true.

  I'd really like to have RBL record all queries made and the results
  thereof, then all the issues above go away -- name changes and logic
  changes just look at the cached result, rule additions w/out cached
  result cause lookups at run-time as they are now.
 
 Maybe, but that is still off in the future.  Huge delay to get that
 throughout all mail.  We get 97% with names and 99% with names and
 dates.

a case of the best being the enemy of the good, I think.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB7gLOMJF5cimLx9ARArgpAJ9tIaSUzsmPSj0TTno1Q2y+25uvvgCeO2tE
VYbl8oz9+dSwE2ysI8ulgDw=
=cQZO
-END PGP SIGNATURE-



results from some DK testing

2005-01-20 Thread Justin Mason
So here's a quick look at some DomainKeys rule freqs, from a quick
mass-check of the last ~10k ham and ~10k spam in my corpus (mass-check
--tail  -j=8 --net --rules '^DK'):

OVERALL%   SPAM% HAM% S/ORANK   SCORE  NAME
  19991 9998 99930.500   0.000.00  (all messages)
100.000  50.0125  49.98750.500   0.000.00  (all messages as %)
  5.783   0.0500  11.51810.004   1.000.00  DK_SIGNED
  0.375   0.0100   0.74050.013   0.33   -0.10  DK_VERIFIED
  0.000   0.   0.0.500   0.330.00  DK_POLICY_SIGNALL
  5.613   6.8714   4.35300.612   0.000.00  DK_POLICY_SIGNSOME
  4.972   6.3013   3.64250.634   0.000.00  DK_POLICY_TESTING


Some notes:

- DK_SIGNED means the message had a DK signature.  DK_VERIFIED
  means that it passed.  most of the failures are due to the
  various crud added to all messages in my corpus, such as:

  - SpamAssassin markup.  we have a bug open to move this to the start of
the headers, instead of the end, which will fix this.  However we may
have to hack a way to ignore those hdrs in the DK plugin, in existing
corpora, otherwise mass-check figures will be really crappy (as
above).

  - other crud added: 'Status', 'X-UID', 'X-Keywords' (all added by my
IMAP server), and 'X-MH-Thread-Markup' (added by my mhthread
script).

  Problem is, most DK records (and the recommended style of signature in
  the draft iirc), is to sign everything *below* the signature point, on
  the assumption that further transitions from the sender to the
  receiver will only every *prepend* headers to the existing set,
  and that the verification will take place inside the recipient's
  external-MX MTA.   My mail has already been through a variety of
  MTAs and both ends of an MDA.

  FWIW, GMail's DK record takes a more IIM-ish approach of signing a
  specific set of important headers like From, Subject, To et al., so
  virtually all of the DK_VERIFIED hits are from GMail.

- so far DK_SIGNED's a great ham sign on its own (not that I'm
  suggesting we should use that, of course).  the 4 spam mails look like
  they'd pass verification -- they're 419 spams sent by hand through
  yahoo and gmail's webmail interfaces.  (yes, they do these by hand.)

- obviously, a rule for DK verification failed, ie. (DK_SIGNED 
  !DK_VERIFIED) would make a lousy anti-spam rule -- it's hitting
  almost all ham here.   that may clear up a bit if we can figure
  out a way to deal with the headers appended in passage issue,
  but possibly not a whole lot, given the fact that DK sigs are
  broken by mailing lists appending footers to the body etc.

- in terms of rules, (DK_SIGNED  DK_VERIFIED 
  DOMAIN_IN_SOME_WHITELIST_OR_ANOTHER) seems like the most likely
  aim.  but we'll need to figure out how to fix those header-manglings
  to get the hitrate anywhere useful.  (0.74% isn't really worth
  a DNS lookup.)

- the DK_POLICY ones are to get an idea of what people are publishing
  in their DK records.   looks like nobody's yet saying we sign
  all outbound mail ;)

- speeds of scans using just the DK rules, in spam:

   4693 0
   3722 1
728 2
472 3
190 4
110 5
 56 6
  9 7
  8 8
 10 9

and in ham:

   6382 0
   2338 1
799 2
349 3
107 4
 11 5
  7 6

(generated with perl -pe 's/^.*scantime=//; s/,.*$//;' ham.log  | sort |uniq -c)
so it's reasonably fast.  (a single DNS lookup takes place on every message.)

--j.


Re: removing the rule-name-length limit (was Re: svn commit: r125722)

2005-01-20 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 [EMAIL PROTECTED] (Justin Mason) writes:
 
  ah, I didn't post examples of what the new formatting looks like --
  here it is in report_safe 1:
 
 The example I posted *was* report_safe 1, so the new formatting does
 not look like that at all.

hmm.  I didn't spot your example -- mistook it for a sig!
I need to figure out why yours looks different...
let me take a look.

  Now, I can't find any agreement in bugzilla that those limits
  should have been imposed. ;)
 
 Nice try, but today's code changes are the things that get votes, not
 code changes from 2 years ago...

sure. just pointing it out.

  our current translations have the following description lengths:
  
  German: under 50 chars: 201   too long: 380: 65% too long
  French: 236 158: 40% too long
  Dutch:  476 113: 19% too long
  Polish: 275 107: 28% too long
  
  in other words *none* of our translations yet implement the 50 character
  limit (bug 4007, bug 4040).   In bug 4040, Klaus notes that he doubts it's
  *possible* to bring German descriptions under 50 characters anyway.
 
 I'm more willing to discuss increases to description lengths (given the
 expansion factor of many languages over English) than rule name lengths
 as I think carrying over to two lines does not render reports that
 unreadable.  However, I think increasing the rule name length from 22 is
 too much.
 
 If I look at all of the rule name lengths from the custom rule sets
 (including a French one) on the Wiki, 9992 rules have a length of 22 or
 lower and only 16 rules have a length of 23 or 24 (none are longer than
 24).
  
  1. allows German-language 70-character descriptions ;)
 
 German typically requires approximately 25-35% the length of English, so
 changing the limit to 65 or 70 characters would be fine with me.
  
  2. allows long enough rule names to support the additional 13 characters
  that should be added to each rule name in automc (bug ID, comment number,
  T_MC_ prefix, and underscores between them, ie.
  T_MC_rulename_2243_13).   right now, we avoid this more-or-less by just
  adding 7 chars, MC_rulename_9Ac.  But still, make test and buildbot
  will fail, if a bug with a rule name of longer than 15 characters in it is
  mass-checked.
 
 Just ignore the limits for T_ rules.  That's fine with me.

OK -- I'm happy to go for:

- relax the description limit to allow 2-line descs
- keep 22-char limit on rulenames
- except for T_-prefix names

That still leaves the problem of a few unreadable rule names, like
FROM_WEBMAIL_END_NUMS6, though.  I'd like to relax the rulename
limit a *little* -- 28 chars maybe?

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB8BgIMJF5cimLx9ARAi8HAJ4vskDkIx1y9WZuGJ8SMOEKrW4g+QCeNie3
tyZbRKefx3bfUAicE79rAeg=
=kgBw
-END PGP SIGNATURE-



Re: svn commit: r125877 - /spamassassin/trunk/t/desc_wrap.t

2005-01-21 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Malte S. Stretz writes:
 On Friday 21 January 2005 04:29 CET [EMAIL PROTECTED] wrote:
  Author: jm
  Date: Thu Jan 20 19:29:20 2005
  New Revision: 125877
 
  URL: http://svn.apache.org/viewcvs?view=revrev5877
  Log:
  fix desc_wrap.t to deal with different Text::Wrap behaviour on older
 
 Maybe we should just require the newer version of Text::Wrap?  Or implement 
 our own wrapping algorithm as Daniel suggested though I prefer to reuse the 
 existing module.

no need -- it's now fixed anyway, so no longer an issue.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB8VB8MJF5cimLx9ARAvriAKCjSuxGNqNiBjo11oGW4o0ydoKNLQCfcQ+i
bpQbn3VSDhiKU1plGIdLpmY=
=q3yo
-END PGP SIGNATURE-



Re: making spamassassin a meta document

2005-01-25 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


I'm happy with the sa-check idea, as long as we keep a spamassassin
wrapper that just does an exec().  easy enough, and very sensible.
+1

I think the POD docs from spamassassin should be split into the
sa-check POD and whatever other sa-blah scripts we come up with
from that.

also, +1 on Michael's sa-history script idea.

however the cvs/svn style, I'm not fond of.  reasons:

- without require-ish hacking, it'll mean all the commands
  would get use'd -- increasing RAM usage.  I'd prefer to
  avoid that.

- having multiple commands as prefix-command, e.g. sa-learn,
  sa-check etc. is good as a UNIX UI -- sa-tab to get the list
  of possible commands.

- the POD file for that one wrapper would be gigantic and unusable.
  we could go for an svn-style spamassassin help, but then we'd have
  to write our own documentation-reading subsystem, which seems like
  wasted effort when POD is already there and already working nicely
  on all platforms.
  
  also, TBH I find that kind of subsystem to be an annoying UI -- do I
  read the man page?  do I type blah help?  blah help commands?
  etc.

- --j.

Malte S. Stretz writes:
 On Sunday 23 January 2005 00:22 CET Daniel Quinlan wrote:
  I've been thinking about bug 3635.
 
  One idea:
 
 rename spamassassin to sa-check
 make spamassassin a meta document that execs sa-check for backwards
compatibility
 
  Another idea:
 
 make spamassassin a meta document that execs sa-check for backwards
compatibility
 move spamassassin pod to spamassassinrun document
 
 Yet another idea:
 
 make spamassassin a caller for all tools, a bit like the cvs commands.  
 Like this:
   old  | new  | calls
 ---+--+
   spamassassin | spamassassin check   | sa-check
   sa-learn | spamassassin learn   | sa-learn
   spamassassin -r  | spamassassin report  | sa-report
   spamassassin -d  | spamassassin clean   | ...
 
 All sub-commands could be moved to /usr/lib/spamassassin (and out of $PATH 
 when some compatibility flag is disabled) at some point.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB9dKqMJF5cimLx9ARAnFrAJ4jZADFAFpatVb3Qv43wzPxIdrIiACfUctg
+oUJccV9ZM55PI5MhRJUHfI=
=Yw+C
-END PGP SIGNATURE-



new DK results

2005-01-27 Thread Justin Mason
from a mass-check run I did last night.  these are more promising;
12% of ham whitelistable:

  19992  99930.500   0.000.00  (all messages)
100.000  50.0150  49.98500.500   0.000.00  (all messages as %)
  6.338   0.0500  12.62880.004   1.000.00  DK_SIGNED
  0.005   0.   0.01000.000   0.530.00  DK_POLICY_SIGNALL
  0.485   0.0400   0.93070.041   0.47   -0.00  DK_VERIFIED
  4.627   5.5506   3.70260.600   0.060.00  DK_POLICY_TESTING
  5.162   6.1206   4.20290.593   0.000.00  DK_POLICY_SIGNSOME

this was achieved by adding code which strips off known appended
headers from the message, such as X-Spam-*, Status, IMAPBase etc.

Records that passed verification were:

954 gmail.com
270 yahoo.com
 10 crynwr.com
  9 earthlink.net
  6 space.net
  5 yahoo-inc.com
  5 omniti.com
  1 sendmail.com
  1 altn.com

and that's it.  AFAICS, most of those domains have only one selector,
so that's a puny 9 DNS lookups?  looking quite promising. ;)

--j.


Re: [SURBL-Discuss] Re: Revisiting high-level 3.1 goals

2005-01-31 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 Raymond Dijkxhoorn [EMAIL PROTECTED] writes:
 
  Please let us know what we should do, cutting out we should announce, the 
  actual removal is just altering one export script...
 
 Considering that SA hasn't shipped with JP yet and that those hosts are
 already caught in WS (which predates JP), I'd announce that you're
 making the change in a week and then make the change.

btw, I think requiring people to upgrade ASAP isn't necessarily a great
idea; we can avoid it by setting up a new BL for WS minus JP.  then
3.1.0 can look up

- JP
- WS_minus_JP

and existing clients can look up

- WS (which includes JP as before)

and upgrade at their leisure...

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB/YWAMJF5cimLx9ARAsCpAJ9djZfXpjb5bnvqwVpB/DhWBj2ZJwCfSWUw
+04XkceKOdaxgxXAG6wXgLQ=
=QBtY
-END PGP SIGNATURE-



Re: Revisiting high-level 3.1 goals

2005-01-31 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Robert Menschel writes:
 Hello Daniel,
 
 Saturday, January 29, 2005, 9:46:05 PM, you wrote:
 
   - higher accuracy: lower FPs and lower FNs (rules, rules, rules... this
 also includes some notion of speeding up the mass-check process)
 
 DQ I've been banging away on this.  We're closer to fixing the autolearn
 DQ thing and Henry has expressed some interest in coordinating a test of
 DQ perfect (train on everything) and perfect-sample (train on sample)
 DQ learning.
 
 DQ bin-doph's ReplaceTags plugin will also really help with rule writing, I
 DQ think, so I hope we get that into the tree soon.
 
 DQ I also now have a working prototype of network-test reuse code and boy
 DQ does it speed up network mass-checks.
 
 Look forward to all of those.  I'm also trying to develop a
 mass-check installation/setup script of my own, based on what you
 were able to give me last year, which will enable people to simply run
 a script and build a mass-check system. It will enable people to do
 their own mass-checks the way we do in SARE, and it will also enable
 them to participate in the primary nightly mass-check run.
 
 My install/setup is still very rough, and has a long way to go, so I
 don't want to try to put a time table on it, but I have hopes it will
 be a help to people.

I'd really like to get mass-check a *lot* more usable -- not sure exactly
what would be involved, though. :(

That was the aim of Duncan's patch in bz, but unfortunately we didn't get
that into 3.0.0 and I think it's a little unlikely to be quite usable by
now.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB/dO4MJF5cimLx9ARAqArAKCC/+r9BEVaPIE2tnD/J2/VJa5Y6ACgs2l1
GCYw9PHE0+TzPZlaE5STiyI=
=asrp
-END PGP SIGNATURE-



Re: svn commit: r149224 - in spamassassin/trunk: lib/Mail/SpamAssassin/PerMsgStatus.pm lib/Mail/SpamAssassin/Plugin.pm lib/Mail/SpamAssassin/Plugin/DefaultAutoLearnDiscriminator.pm rules/10_misc.cf rules/init.pre

2005-01-31 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Michael Parker writes:
 Few things:
 
 1) I thought plugin callss couldn't return values?

actually -- they can.  (I think in the config file case, it was the type
of the return value changing frequently that was problematic).

 2) I like the Plugin API, but why not keep the default in the code and
allow added plugins to override?  Doesn't need it's own default
plugin.

well, effectively doing this as a default plugin *does* this, without
adding extra code for plugins to indicate do not run the default
code, which is why I did it this way.  (however perhaps it doesn't
need to be in the Mail::SpamAssassin::Plugin hierarchy, it could
be named something else.)

 3) MANIFEST, assuming the plugin stays.

oops!

Anyway, I'm fine with changing the details if necessary.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB/dTgMJF5cimLx9ARAiM4AJ4u1d3AMSk/n7KpUp8TQndqxH45vgCfW5b8
6HXUlmWht3C71JZo89cjfc0=
=+FuG
-END PGP SIGNATURE-



Re: svn commit: r149224 - in spamassassin/trunk: lib/Mail/SpamAssassin/PerMsgStatus.pm lib/Mail/SpamAssassin/Plugin.pm lib/Mail/SpamAssassin/Plugin/DefaultAutoLearnDiscriminator.pm rules/10_misc.cf rules/init.pre

2005-01-31 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Michael Parker writes:
 On Sun, Jan 30, 2005 at 10:49:04PM -0800, Justin Mason wrote:
  Michael Parker writes:
   Few things:
   
   1) I thought plugin callss couldn't return values?
  
  actually -- they can.  (I think in the config file case, it was the type
  of the return value changing frequently that was problematic).
 
 So, in the case where more than one plugin handles a call, which value
 is returned?  last one run wins?

if any return a defined value, that is used.
actually, it's ||= -- so in this case it's a little more complex because
one return value supported is 0 as well as undef.

hmm.  it may be better to have the last plugin get the return value.

   2) I like the Plugin API, but why not keep the default in the code and
  allow added plugins to override?  Doesn't need it's own default
  plugin.
  
  well, effectively doing this as a default plugin *does* this, without
  adding extra code for plugins to indicate do not run the default
  code, which is why I did it this way.  (however perhaps it doesn't
  need to be in the Mail::SpamAssassin::Plugin hierarchy, it could
  be named something else.)
 
 Something like:
 
 $foo = call_plugin
 
 if !defined($foo)
   the default code goes here to set $foo

There'd have to be an additional boolean indicating a plugin handled
this, rather than just returning undef -- since the API is tri-state
(undef/0/1 as return values).

 If someone disabled the plugin in init.pre and did not install their
 own plugin, what would happen?

no autolearning occurs, everything else works as expected. ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB/p7WMJF5cimLx9ARAkQkAJ9N5PvCCyg6mcsWq+e/L/fH9twkrgCcCuW3
CAhzpyQod68yMx2qss8S6N4=
=Fph7
-END PGP SIGNATURE-



Re: svn commit: r149224 - in spamassassin/trunk: lib/Mail/SpamAssassin/PerMsgStatus.pm lib/Mail/SpamAssassin/Plugin.pm lib/Mail/SpamAssassin/Plugin/DefaultAutoLearnDiscriminator.pm rules/10_misc.cf rules/init.pre

2005-01-31 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Michael Parker writes:
 This is a MIME-formatted message.  If you see this text it means that your
 E-mail software does not support MIME-formatted messages.
 
 --=_mail-11561-1107210743-0001-2
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 Content-Transfer-Encoding: quoted-printable
 
 On Mon, Jan 31, 2005 at 05:25:25PM -0500, Theo Van Dinter wrote:
  On Mon, Jan 31, 2005 at 04:16:01PM -0600, Michael Parker wrote:
   Which might be ok, but I can promise you that someone is going to go
   through and either rm init.pre or comment out every loadplugin line
   and then start asking questions about why their system isn't
   autolearning.
  
  Yeah, but they'll do the exact same thing with SURBL, Razor, etc.
 
 I'm not so worried about those, those are all pretty much self
 contained, so if they get shutoff no harm done.  It's turning off
 pieces of a system in the core that bothers me.

ok, you've convinced me... feel free to refactor that back into core, I
think.  It seems this *is* a little more core than Razor, Pyzor et al.

(probably easiest to just rename the module back into the
Mail::SpamAssassin::* namespace, then add the if defined() glue after
the call_plugins call, rather than pushing the subs back into
PerMsgStatus entirely.)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB/rcuMJF5cimLx9ARAnQVAJ9yVRNgSbnD7ZKzHNkteQeUhO48hwCghg9B
XraXcolAC9K13q7RIVTkq5E=
=vhgv
-END PGP SIGNATURE-



Re: optional vs. standard plugins

2005-01-31 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 Michael Parker wrote:
  Which might be ok, but I can promise you that someone is going to go
  through and either rm init.pre or comment out every loadplugin line
  and then start asking questions about why their system isn't
  autolearning.
 
 Losing autolearning if someone deletes init.pre is completely
 acceptable.  Autolearning *should* be optional and pluggable.  Making it
 pluggable allows people to experiment and try out other autolearning
 mechanism and I suspect we'll see some usage of the API soon.  ;-)
 
 We could also add a new autolearn state like notloaded.

That may indeed be a good idea, since this is now a new way
for people to screw up their configs ;)

 Theo Van Dinter [EMAIL PROTECTED] writes:
 
  I think we should shoot for a goal of when all plugins are disabled
  the system should still do the right thing.  If that means that we at
  least provide a default inline that can be overridden by a plugin,
  then that is how we should do it.
 
 Not autolearning if it has been disabled *is* the right thing.  Things
 work fine if autolearning is off.
 
 Also, our current autolearning code does not improve results by that
 much in practice (which is why it needs to be revisited and other ways
 to autolearn to be explored).  See Gordon C.'s paper for those results.
 
  I'll provide a slightly different version: for code that people are
  likely not to override (such as autolearning), we should probably just
  have it be in the code by default and let plugins override as
  necessary..
 
 I disagree in this case, although I think there are probably some cases
 where things are likely to not be overridden.  Users are going to
 encounter plugins and they're now a major part of basic SpamAssassin
 functionality (much like Apache httpd, incidentally)

not a coincidence...

, we should just
 document things well enough.  If people comment out stuff without
 thinking, then there's not too much we can do about it.

That's true.   init.pre is exactly analogous to httpd.conf; an Apache
install can be rendered thoroughly useless by turning off the wrong
plugins.

 For plugins that are likely to not be overridden, I'd be fine with
 splitting init.pre into two or more files, like:
 
   standard.pre
   optional.pre
   experimental.pre
 
 or whatever.  That would go a long way to guiding people as to how
 seriously they need to think before commenting stuff out.

And, FWIW, I think I wrote the pre code to load all files that
end in .pre, so this should work if we want that.

 Of course, I agree ** 100% ** that everything should work as in not
 fail if all plugins are commented out.  There might be a few cases
 where plugins have cross-dependencies, but we should make sure our code
 deals with those and acts appropriately (warn, die, dbg, or whatever,
 but *no* straight Perl interpreter errors!).
 
 Also, putting a line next to the AutoLearnThreshold load line such as:
 
 # at least one AutoLearn plugin needs to be loaded for autolearning to work
 
 is more than enough to prevent a stupid commenting out.  If people just
 comment stuff out without thinking or delete init.pre, we can't save
 them.

OK, I agree with everything in this message ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB/r9DMJF5cimLx9ARAvKkAJwJ9pXdNHpGBdanCZsRwsRzWZN9sQCggoQn
DcAYHloban14xSGPq2dXvaU=
=25W7
-END PGP SIGNATURE-



Re: optional vs. standard plugins

2005-01-31 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Michael raised a good point on IRC:

Herk Theo makes a good point on the existing init.pre file, upgrades
aren't going to get the new loadplugin lines added to their init.pre

jmason well, it's exactly analogous to Apache's httpd.conf file

Herk true, but can you disable some small piece of core functionality by
not updating your conf file?  I'm having trouble thinking of a concrete
example in other types of servers Herk that match this case Herk FYI,
I'm not -1 on the plugin, just stating an opinion that I don't believe
having a plugin for the default case is needed/wise

jmason hmm.  you know, that's a point alright, this may cause upgrade
issues.  I hadn't considered that jmason specifically that 3.0.0 already
has an init.pre, and we haven't currently got code that'll overwrite one
of those

so in other words, if users upgrade 3.0.x  3.1.0, unless we add code
to our installer to deal with this, it'll mean they'll have to manually
edit init.pre to add loadplugin lines for the new default discriminator.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB/sSHMJF5cimLx9ARAr5DAJ48oLP+8lqWWdaNl30ThILCZN2FOACgnxmn
k2G76A+i5iZp3Ez21Po1z2E=
=eObC
-END PGP SIGNATURE-



Re: svn commit: r151753 - spamassassin/trunk/lib/Mail/SpamAssassin/Plugin.pm

2005-02-07 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


[EMAIL PROTECTED] writes:
 +Note: there are no guarantees that the internal data structures of
 +SpamAssassin will not change from release to release.  In particular to
 +this plugin hook, if you modify the rules data structures in a
 +third-party plugin, all bets are off until such time that an API is
 +present for modifying that configuration data.

... that makes the new plugin API sound quite a bit less useful ;)
what's it being added for?

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCB+ApMJF5cimLx9ARAi4UAJoCiPFUCnirR+kOSXdqQbfZubEwrgCgnn+H
Sp6mR0k4rM/hIm8DEIToeVw=
=5wek
-END PGP SIGNATURE-



Re: svn commit: r151753 - spamassassin/trunk/lib/Mail/SpamAssassin/Plugin.pm

2005-02-07 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 [EMAIL PROTECTED] (Justin Mason) writes:
 
  ... that makes the new plugin API sound quite a bit less useful ;)
  what's it being added for?
 
 ReplaceTags.
 
 I hope to eventually clean up the internals and make it okay after 3.1,
 but it's just a bit too hairy right now to feel okay about making the
 API usable for random third-party plugins (it's fine as long as the
 integrator checks compatibility).

yeah, I was hoping for the pass rule code for rules with certain tflags
into the plugin for substitution approach as I mentioned before -- that
doesn't require the plugin to delve into the Conf structure to do it.

 The API is still generally useful if you have *anything* to do at end of
 parsing (for example, tying a DB as in accessdb), stuff that doesn't
 involve internal APIs.

ah, that's a good point.  that hadn't occurred to me...

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCB/i5MJF5cimLx9ARAjZFAKC1PlK/1jLhtdESD4TQq8QL3TnKtQCfSubV
gA6QHfieBiwF7SYuUoNuu+k=
=PWDL
-END PGP SIGNATURE-



Re: RFC: Plan for faster updates

2005-02-08 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Theo Van Dinter writes:
 Ok, here are my thoughts about how to do faster updates.  ie: how
 to release rules + scores faster, potentially multiple times a day.
 I currently only think rules + scores ought to be released this way -- people
 aren't going to be comfortable with automated code updates IMO.  Code/plugins
 are best left to full releases.  (plugin support could be easily added later
 on, btw.)
 
 Pseudo-code is below, but here's some background details:
 
 Updates occur from channels.  The default channel is
 updates.spamassassin.org, but the user can specify any number of
 channels on the commandline to use additionally.  These can either be
 provided by us (think of updates being stable vs expirimental vs ...),
 or some third party (as long as they provide the same infrastructure...)

cool.

 Updates have version numbers.  The value format of which is irrelevent,
 as long as its monotonically increasing.  For our updates I was thinking
 SVN revision, but could also do MMDDVV ala DNS SOA, etc.
 
 Versions are tracked per channel and SpamAssassin version.  To check
 for updates, do a DNS TXT query ala z.y.x.updates.spamassassin.org,
 where z.y.x refers to the version of SpamAssassin being used, aka:
 x.y.z for 3.0.2, etc.  For simplicitly, wildcards can be used on the
 DNS server to match a whole set of releases.  An example:
 
 *.0.3.updates.spamassassin.org TXT 154203
 *.1.3.updates.spamassassin.org TXT 158203
 
 I haven't decided if that needs to be more machine parsable for future
 expansion.  ie: v=1 ver=154023    I can't think of anything off hand
 that would need to go in there so just a version number is probably ok.
 
 For the initial request, mirrors.channel is a TXT record with an URL for
 the MIRRORED.BY (ie: http://spamassassin.apache.org/updates/MIRRORED.BY),
 which contains a list of parent URLs, and an optional list of options
 per mirror.  ie:
 
 http://spamassassin.apache.org/updates weight=20
 http://spamassassin.kluge.net/updates
 http://somemirror.example.com/spamassassin/updates weight=4
 
 Means there are 3 mirrors, weighted so the apache.org one will be used the
 most (80% of the time), followed by the example.com one (16% of the time),
 followed by the kluge.net one (4% of the time).  Weights are default
 '1', btw.
 
 The directory that is to be mirrored out appropriately looks like:
 
 dir/
   MIRRORED.BY
   version.ext
   version.ext.sha1
   ...
   versionn.ext
   versionn.ext.sha1
 
 with version.ext.gpg .. versionnn.ext.gpg available optionally.
 I don't think GPG needs to be required, but for the paranoid amongst us,
 it needs to be available as an option.
 
 At the end, the script outputs a number of channel.cf files, which by
 default will just be read by SpamAssassin at startup (leaving restarting
 spamd up to the admin outside the script, based on exit code...)  If a
 different directory is used, admin can simply include the channel.cf
 file in their local.cf.
 
 There are a few things I haven't fully fleshed out yet:
 
 1) How to archive the update files together?  I envisioned a similar
 naming convention to our normal rules directory (ie: a bunch of files
 named ##_type.cf), but the script should just expect to download a single
 file which will then be expanded.  I don't want to rely on system calls to
 run an expansion, nor do I want to expect tar or zip to be installed, etc.
 
 2) How to validate with GPG?  Similar to the archive issue.  Perhaps using
 GnuPG::Interface?  It's really just a wrapper to running gpg from the
 commandline, but at least abstracts the issue for platforms where gpg isn't
 what I think it is.
 
 3) Using channel.cf means that it may or may not come after local.cf.
 We should probably use some form of prefix to get it to load beforehand,
 but what?  People should be able to override the channel config if
 they want to.  I don't know if I want AA_updates_spamassassin_org.cf
 as a file.
 
 Pseudo code:
 
 - Script has a list of GPG keys which are allowed to sign update releases.
   The default is 265FA05B, which is the SA signing key.
 - load Mail::SpamAssassin
 - load Digest::SHA1
 - load LWP
 - Accept commandline options for GPG keys to allow for signing in addition
   to default (for third-party updates).
 - Accept commandline option for whether or not to use GPG for verification.
 - Accept commandline options for additional channels to use beyond
   updates.spamassassin.org
 - Accept commandline option for parent directory for updates.  Default is
   whatever the first site_rules_path value is, ie: /etc/mail/spamassassin.
   ala: $msa-first_existing_path (@M::SA::site_rules_path);
 - Accept other options such as debug, version, etc.
 - exit code = 255
 - foreach ( @channels ):
   - Convert channel name to platform friendly version?  Is
 foo.bar.baz.etc.example.com ok for all platforms?  I was thinking
 s/\./_/g

+1 on that.

   - read 

Re: RFC: Plan for faster updates

2005-02-09 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Robert Menschel writes:
 TVD Versions are tracked per channel and SpamAssassin version.  To check
 TVD for updates, do a DNS TXT query ala
 TVD z.y.x.updates.spamassassin.org,
 TVD where z.y.x refers to the version of SpamAssassin being used, aka:
 TVD x.y.z for 3.0.2, etc.  For simplicitly, wildcards can be used on the
 TVD DNS server to match a whole set of releases.  An example:
 
 TVD *.0.3.updates.spamassassin.org TXT 154203
 TVD *.1.3.updates.spamassassin.org TXT 158203
 
 And I assume that *.*.3 would also be viable to accept rules for all
 3.x.x versions, or more to the point, *.*.2 could be used within SARE
 to flag rules that apply to all 2.xx versions that predate 3.0.0.

Hold on -- something's just occurred to me from the SPF development;
this won't be possible, because BIND doesn't support bar.*.foo wildcards
(ie. wildcards in a non-lowest-level record.)

We may have to have a way to explicitly mark wildcards.  in other
words, do lookups like

1.0.3.updates.spamassassin.org
star.0.3.updates.spamassassin.org
star.3.updates.spamassassin.org
star.updates.spamassassin.org   (possibly N/A)

- --j.

 TVD The directory that is to be mirrored out appropriately looks like:
 TVD dir/
 TVD  MIRRORED.BY
 TVD  version.ext
 TVD  version.ext.sha1
 TVD  ...
 TVD  versionn.ext
 TVD  versionn.ext.sha1
 
 TVD with version.ext.gpg .. versionnn.ext.gpg available optionally.
 TVD I don't think GPG needs to be required, but for the paranoid
 TVD amongst us, it needs to be available as an option.
 
 Where do these updates come from?  When would the GPG signature be
 applied, and by whom/what?  Within SARE we have multiple working
 files, and I can see our scripts combining all files that match a
 given critiera into a single channel file. The original files are
 sometimes signed to validate them, but I don't see any value to having
 an automated script sign the compilation. I suppose it might be a YMMV
 situation.

yep.  at the least, this serves to avoid someone subverting a mirror and
putting up their own files without at least stealing the signing key too.
It's definitely a good idea.

 TVD At the end, the script outputs a number of channel.cf files,
 TVD which by default will just be read by SpamAssassin at startup
 TVD (leaving restarting spamd up to the admin outside the script,
 TVD based on exit code...)  If a different directory is used, admin
 TVD can simply include the channel.cf file in their local.cf.
 
 Good.
 
 TVD There are a few things I haven't fully fleshed out yet:
 
 TVD 1) How to archive the update files together?  I envisioned a
 TVD similar naming convention to our normal rules directory (ie: a
 TVD bunch of files named ##_type.cf), but the script should just
 TVD expect to download a single file which will then be expanded.  I
 TVD don't want to rely on system calls to run an expansion, nor do I
 TVD want to expect tar or zip to be installed, etc.
 
 I would think that the compilation script could simply cat the
 component files together.  eg [I often use shell as my meta language]:
version=$mmddhhss # simple version calc
# loop through compilation definition files.
# For each definition, grab output file name from line 1.
# Remainder of lines name files fed into compilation.
for compilefile in $compiledir/*.compile ; do
   outfile=$( sed 1q $(compilefile) )
   newer=no  # assume this compilation not updated
   # For each file in the compilation, check to see if it is newer
   # than the last compilation built.
   for infile in $( sed -n 2,\$p $compilefile ) ; do
  if [[ $infile -nt $outfile ]]
  then newer=yes
  fi
   done
   # If any input file is newer than the last compilation built,
   # the build a new compilation.
   if [[ $newer = yes ]]
   then echo $version  $outfile
cat $( sed -n 2,\$p $compilefile ) $outfile
   fi
done
 
 TVD 3) Using channel.cf means that it may or may not come after
 TVD local.cf. We should probably use some form of prefix to get it to
 TVD load beforehand, but what?  People should be able to override the
 TVD channel config if they want to.  I don't know if I want   
 TVD AA_updates_spamassassin_org.cf
 TVD as a file.
 
 I would agree that we want all channel files to come before local.cf
 alphabetically, and also want them to have reasonably short names.
 
 What about a name like CH.$channel.$abbr.cf where $channel is the
 channel file name (eg: updates, scores, hispamnoham, etc), and $abbr
 is an abbreviation for the source of that channel (perhaps fed through
 a second field on line 1, or through the second line of the channel
 file).  That would give us files like:
 CH.updates.SA.cf
 CH.scores.SA.cf
 CH.hispamnoham.SARE.cf
 
 This leaves open the question of how do we prioritize the occasional
 override?
 
 Let's say SARE includes an english channel, containing our rules
 

Re: [Bug 4124] New: New spamassassin script doesn't work due to tainting

2005-02-10 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Daniel Quinlan writes:
 Malte S. Stretz [EMAIL PROTECTED] writes:
 
  I'll fix this (it needs to be done via the B_FOO (build) and I_FOO
  (install) hacks).
 
 Thanks, I sent a few comments in my last message.  ;-)
 
  Just to be sure: spamassassin is always in the same dir as sa-filter?
  So the symlink can be spamassassin-sa-filter and doesn't have to
  contain the absolute path (which is impossible)?
 
 Yes, it will always be in the same directory.
 
 Maybe we should require a new separate file under build for the MY stuff
 to remove some of the base code (as opposed to the build instructions).

+1 -- I think that's a very good idea.

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCCswcMJF5cimLx9ARAgYEAJ99tjsn4l96mf6ZmRRQA4NbSRI5CQCgtucY
9/rNEprBjJCYCl0rC9f6G/c=
=O7rW
-END PGP SIGNATURE-



Re: Broken .htaccess in spamassassin.apache.org

2005-02-15 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Sander Striker writes:
 Subject says it all.  Please fix ASAP.

did it get fixed?  appears to be working now.  the contents are:

  Redirect /doc http://spamassassin.apache.org/full/3.0.x/dist/doc
  Redirect /downloads.html 
http://spamassassin.apache.org/downloads.cgi?update=200409211830
  Redirect /favicon.ico http://spamassassin.apache.org/images/favicon.ico

Sander, what URL did you see failures on?

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCEUn6MJF5cimLx9ARAu12AJ9QML84A+pqc2fP/XCQvIAU/RqBIACguzxR
LlTEsId7Jr6H8MRtPdwydR4=
=im/Z
-END PGP SIGNATURE-



Re: Re[2]: RFC: Plan for faster updates

2005-02-15 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Robert Menschel writes:
 Hello Theo,
 
 Saturday, February 12, 2005, 11:21:16 PM, you wrote:
 
  3) Using channel.cf means that it may or may not come after local.cf.
  We should probably use some form of prefix to get it to load beforehand,
  but what?  People should be able to override the channel config if
  they want to.  I don't know if I want
  AA_updates_spamassassin_org.cf
  as a file.
 
 TVD I haven't come up with anything for this yet.
 
 Since hit-frequencies requires numeric prefixes to give us stats
 concerning hit ratios, and since 60_whitelist.cf is the highest
 numbered file in distribution, I'd suggest maybe 65.$channel.cf for
 all channel files?
 
 Or use 65.update.cf for the distribution channel, and let other
 channels supply the numeric prefix as part of their channel name?

actually, I just fixed the bug that required the numeric prefixes
last week, so that's no longer a problem ;)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCEWTyMJF5cimLx9ARAuH8AJ40NY9n4fh5cf27VKXvJDpNxKSsRACeK6g2
T7O9kDF37qjVycXuds5WGyY=
=DEv8
-END PGP SIGNATURE-



Re: Trie optimisation of simple alternations for blead perl.

2005-02-15 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


demerphq writes:
 As you can see, except for the _construct benchmarks, B wins by a
 large margin. The _construct tests are designed to see what the
 overhead is of constructing the trie for nothing (ie, the match is at
 ^), and shows that the construct time is half as fast, (this is
 unsurprising as the entire cost of A must be carried by B as the
 optimisation doesnt occur until study_chunk()). OTOH the parse times
 are much better. perl_keywds searches for a list of words like perl
 keywords in the bench script that comes as part of perlbench, and
 shows that for this type of matching the trie is much much faster than
 the current mechanism.

FWIW, this looks like it'd be excellent for SpamAssassin ;)

I haven't had much time to look over the implementation, and I'm not
really any use for reviewing it from a p5p POV due to lack of familiarity
with perl internals, but the benchmark figures look fantastic and the
implementation details sound good.

I'd love to see this get into perl, even if just as an option enabled
through a use pragma.   (in my opinion, if your regexp will benefit
from a trie, you will know that in advance.)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCEl5SMJF5cimLx9ARAkZzAJwOzVx2bCXNu0S1tWCLsP9mCNrSbACfaOYb
V2eYWw4dhf756XEfZccu2F8=
=yTzv
-END PGP SIGNATURE-



  1   2   3   4   5   6   7   8   9   10   >