I found the check_whitelist tool very convenient for examining who had sent
me mail and what scores they had earned.  But I often found myself wanting
to study certain individuals and/or domains, and sometimes to use the same
as --clean fodder.  So I extended the script to understand the new options
--addr and --domain.

I also found the output format a little too rigid for my taste, as I like
to pipe the output to sort with k1n, k3n or k5n as an argument, but the
'(' / ')' were sometimes interfering, and typing a long sed entry every
time was a pain.  So I altered the white-space a bit accordingly.

I will list the diffs (in -u format) in-line below and attach the entire
updated file; hopefully others will find these changes worthwhile as well.

--- check_whitelist~    Thu Jul 15 03:47:38 2004
+++ check_whitelist     Tue Jan 11 14:58:42 2005
@@ -4,7 +4,7 @@
 
 sub usage {
   die "
-usage: check_whitelist [--clean] [--min n] [dbfile]
+usage: check_whitelist [--clean] [--min n] [--addr addr | --domain domain] 
[dbfile]
 ";
 }
 
@@ -13,18 +13,26 @@
 use Getopt::Long;
 
 use vars qw(
-               $opt_clean $opt_min $opt_help
+               $opt_clean $opt_min $opt_addr $opt_domain $opt_help
        );
 
 GetOptions(
   'clean'              => \$opt_clean,
   'min:i'              => \$opt_min,
+  'addr:s'             => \$opt_addr,
+  'domain:s'           => \$opt_domain,
   'help'               => \$opt_help
 ) or usage();
 $opt_help and usage();
 
 $opt_min ||= 2;
+$opt_addr ||= '';
+$opt_domain ||= '';
 
+if ($opt_addr ne '' && $opt_domain ne '') {
+       die "addr and domain options are mutually exclusive\n";
+}
+
 BEGIN { @AnyDBM_File::ISA = qw(DB_File GDBM_File NDBM_File SDBM_File); }
 use AnyDBM_File ;
 
@@ -51,14 +59,28 @@
   my $count = $h{$key};
   next unless defined($totscore);
 
+  # There are 3 reasons to skip a given key:
+  # 1. clean was specified (but no addr or domain) and the count is above min.
+  if ($opt_clean && $count >= $opt_min && $opt_addr eq '' && $opt_domain eq 
'') {
+    #printf "skipping (count) %s\n", $key;
+    next;
+  }
+  # 2. An addr was specified but the key does not match.
+  if ($opt_addr ne '' && !($key =~ /^$opt_addr/)) {
+    #printf "skipping (addr) %s\n", $key;
+    next;
+  }
+  # 3. A domain was specified but the key does not match.
+  if ($opt_domain ne '' && !($key =~ /[EMAIL PROTECTED]|/)) {
+    #printf "skipping (domain) '%s'\n", $key;
+    next;
+  }
   if ($opt_clean) {
-    if ($count >= $opt_min) { next; }
     print "cleaning: ";
   }
 
-  printf "% 8.1f %15s  --  %s\n",
-                 $totscore/$count, (sprintf "(%.1f/%d)",$totscore,$count),
-                 $key;
+  printf "% 6.1f %15s -- %s\n", $totscore/$count,
+    (sprintf "( % 7.1f / %3d )",$totscore,$count), $key;
 
   if ($opt_clean) {
     delete $h{"$key|totscore"};
@@ -73,7 +95,7 @@
 
 =head1 SYNOPSIS
 
-B<check_whitelist> [--clean] [--min n] [dbfile]
+B<check_whitelist> [--clean] [--min n] [--addr s | --domain s] [dbfile]
 
 =head1 DESCRIPTION
 
@@ -97,6 +119,15 @@
 used.  The default is C<2>, so entries that have only been seen once are
 deleted.
 
+=item --addr s
+
+Select an individual address to be deleted.
+
+=item --domain s
+
+Select an domain to be deleted: all addresses @ that domain of @ any
+sub-domain of that domain will be deleted.
+
 =back
 
 =head1 OUTPUT
@@ -107,8 +138,8 @@
 
 For example:
 
-     0.0         (0.0/7)  --  [EMAIL PROTECTED]|ip=208.192
-    21.8        (43.7/2)  --  [EMAIL PROTECTED]|ip=200.106
+  0.0 (     0.0 /   7 ) -- [EMAIL PROTECTED]|ip=208.192
+ 21.8 (    43.7 /   2 ) -- [EMAIL PROTECTED]|ip=200.106
 
 C<AVG> is the average score;  C<TOTSCORE> is the total score of all mails seen
 so far;  C<COUNT> is the number of messages seen from that sender;  C<EMAIL> is

-- John
#!/usr/bin/perl
#
# TODO: should this be made a top-level script, called "sa-awl"?

sub usage {
  die "
usage: check_whitelist [--clean] [--min n] [--addr addr | --domain domain] 
[dbfile]
";
}

use strict;
use Fcntl;
use Getopt::Long;

use vars qw(
                $opt_clean $opt_min $opt_addr $opt_domain $opt_help
        );

GetOptions(
  'clean'               => \$opt_clean,
  'min:i'               => \$opt_min,
  'addr:s'              => \$opt_addr,
  'domain:s'            => \$opt_domain,
  'help'                => \$opt_help
) or usage();
$opt_help and usage();

$opt_min ||= 2;
$opt_addr ||= '';
$opt_domain ||= '';

if ($opt_addr ne '' && $opt_domain ne '') {
        die "addr and domain options are mutually exclusive\n";
}

BEGIN { @AnyDBM_File::ISA = qw(DB_File GDBM_File NDBM_File SDBM_File); }
use AnyDBM_File ;

my $db;
if ($#ARGV == -1) {
  $db = $ENV{HOME}."/.spamassassin/auto-whitelist";
} else {
  $db = $ARGV[0];
}

my %h;
if ($opt_clean) {
  tie %h, "AnyDBM_File",$db, O_RDWR,0600
      or die "Cannot open r/w file $db: $!\n";
} else {
  tie %h, "AnyDBM_File",$db, O_RDONLY,0600
      or die "Cannot open file $db: $!\n";
}

my @k = grep(!/totscore$/,keys(%h));
for my $key (@k)
{
  my $totscore = $h{"$key|totscore"};
  my $count = $h{$key};
  next unless defined($totscore);

  # There are 3 reasons to skip a given key:
  # 1. clean was specified (but no addr or domain) and the count is above min.
  if ($opt_clean && $count >= $opt_min && $opt_addr eq '' && $opt_domain eq '') 
{
    #printf "skipping (count) %s\n", $key;
    next;
  }
  # 2. An addr was specified but the key does not match.
  if ($opt_addr ne '' && !($key =~ /^$opt_addr/)) {
    #printf "skipping (addr) %s\n", $key;
    next;
  }
  # 3. A domain was specified but the key does not match.
  if ($opt_domain ne '' && !($key =~ /[EMAIL PROTECTED]|/)) {
    #printf "skipping (domain) '%s'\n", $key;
    next;
  }
  if ($opt_clean) {
    print "cleaning: ";
  }

  printf "% 6.1f %15s -- %s\n", $totscore/$count,
    (sprintf "( % 7.1f / %3d )",$totscore,$count), $key;

  if ($opt_clean) {
    delete $h{"$key|totscore"};
    delete $h{$key};
  }
}
untie %h;

=head1 NAME

check_whitelist - examine and manipulate SpamAssassin's auto-whitelist db

=head1 SYNOPSIS

B<check_whitelist> [--clean] [--min n] [--addr s | --domain s] [dbfile]

=head1 DESCRIPTION

Check or clean a SpamAssassin auto-whitelist (AWL) database file.

The name of the file is specified after any options, as C<dbfile>.
The default is C<$HOME/.spamassassin/auto-whitelist>.

=head1 OPTIONS

=over 4

=item --clean

Clean out infrequently-used AWL entries.  The C<--min> switch can be
used to select the threshold at which entries are kept or deleted.

=item --min n

Select the threshold at which entries are kept or deleted when C<--clean> is
used.  The default is C<2>, so entries that have only been seen once are
deleted.

=item --addr s

Select an individual address to be deleted.

=item --domain s

Select an domain to be deleted: all addresses @ that domain of @ any
sub-domain of that domain will be deleted.

=back

=head1 OUTPUT

The output looks like this:

     AVG  (TOTSCORE/COUNT)  --  EMAIL|ip=IPBASE

For example:

  0.0 (     0.0 /   7 ) -- [EMAIL PROTECTED]|ip=208.192
 21.8 (    43.7 /   2 ) -- [EMAIL PROTECTED]|ip=200.106

C<AVG> is the average score;  C<TOTSCORE> is the total score of all mails seen
so far;  C<COUNT> is the number of messages seen from that sender;  C<EMAIL> is
the sender's email address, and C<IPBASE> is the B<AWL base IP address>.

B<AWL base IP address> is a way to identify the sender's IP address they
frequently send from, in an approximate way, but remaining hard for spammers to
spoof.  The algorithm is as follows:

  - take the last Received header that contains a public IP address -- namely
    one which is not in private, unrouted IP space.

  - chop off the last two octets, assuming that the user may be in an ISP's
    dynamic address pool.

=cut

Reply via email to