I removed IO::SigGuard dependency from my diff and I will work on deleting Email::MIME as well. Giovanni
On 10/17/21 16:58, Henrik K wrote: > > Atleast these seem completely unneeded module dependencies. > > IO::SigGuard (not even found in Ubuntu packages) > Email::MIME > > So the code should be refactored to use SA methods as necessary. > > > On Sat, Oct 16, 2021 at 11:06:07PM -0400, Kevin A. McGrail wrote: >> No worries there that I know of. >> >> cPanel has the paperwork for CCLA on file and several people with ICLA's as >> well. They've given us permission to commit the code too. >> >> I think it will be better than any dependency on external binaries. >> >> Regards, >> >> KAM >> >> On 10/14/2021 10:37 AM, Henrik K wrote: >>> If that's the case, I probably wouldn't have any objections. Not sure if it >>> requires some Contributor License Agreement from cPanels part (maybe they >>> already have one), and I guess atleast a bug to make it official.. Sidney >>> or KAM can probably chime in on the admin side.. >>> >>> >>> On Thu, Oct 14, 2021 at 04:32:53PM +0200, Giovanni Bechis wrote: >>>> Once committed, code will be no more developed by cPanel on CPAN >>>> and original code will be removed. >>>> >>>> I can work to integrate old and new Pyzor versions. >>>> >>>> Giovanni >>>> >>>> On Thu, Oct 14, 2021 at 05:27:16PM +0300, Henrik K wrote: >>>>> If it's developed by cPanel in CPAN, then it should not be committed to >>>>> SA, >>>>> unless it's clearly donated to SpamAssassin and removed from CPAN. >>>>> Assuming >>>>> we have developer resources and will to take it aboard. >>>>> >>>>> As it is, Plugin/Pyzor.pm should have an option to choose which one to >>>>> use, >>>>> as it makes no sense to ditch support for the widely installed original >>>>> Pyzor. >>>>> >>>>> >>>>> On Thu, Oct 14, 2021 at 04:15:13PM +0200, Giovanni Bechis wrote: >>>>>> Hi, >>>>>> cPanel has developed a native Perl Pyzor implementation for SpamAssassin >>>>>> and a diff against SpamAssassin 4.0 follows. >>>>>> Atm I am using it in production on a small server, more tests and >>>>>> opinions are welcome. >>>>>> >>>>>> Original cPanel code is at https://metacpan.org/pod/Mail::Pyzor. >>>>>> >>>>>> Cheers >>>>>> Giovanni >>>>>> >>>>>> diff --git a/MANIFEST b/MANIFEST >>>>>> index 25d0192..2d9588c 100644 >>>>>> --- a/MANIFEST >>>>>> +++ b/MANIFEST >>>>>> @@ -126,6 +126,11 @@ lib/Mail/SpamAssassin/Plugin/WLBLEval.pm >>>>>> lib/Mail/SpamAssassin/Plugin/WhiteListSubject.pm >>>>>> lib/Mail/SpamAssassin/PluginHandler.pm >>>>>> lib/Mail/SpamAssassin/Plugin/URILocalBL.pm >>>>>> +lib/Mail/SpamAssassin/Pyzor/Client.pm >>>>>> +lib/Mail/SpamAssassin/Pyzor/Digest/Pieces.pm >>>>>> +lib/Mail/SpamAssassin/Pyzor/Digest/StripHtml.pm >>>>>> +lib/Mail/SpamAssassin/Pyzor/Digest.pm >>>>>> +lib/Mail/SpamAssassin/Pyzor.pm >>>>>> lib/Mail/SpamAssassin/RegistryBoundaries.pm >>>>>> lib/Mail/SpamAssassin/Reporter.pm >>>>>> lib/Mail/SpamAssassin/SQLBasedAddrList.pm >>>>>> diff --git a/lib/Mail/SpamAssassin/Plugin/Pyzor.pm >>>>>> b/lib/Mail/SpamAssassin/Plugin/Pyzor.pm >>>>>> index 3efd4b4..e4c9c05 100644 >>>>>> --- a/lib/Mail/SpamAssassin/Plugin/Pyzor.pm >>>>>> +++ b/lib/Mail/SpamAssassin/Plugin/Pyzor.pm >>>>>> @@ -36,17 +36,13 @@ package Mail::SpamAssassin::Plugin::Pyzor; >>>>>> use Mail::SpamAssassin::Plugin; >>>>>> use Mail::SpamAssassin::Logger; >>>>>> -use Mail::SpamAssassin::Timeout; >>>>>> -use Mail::SpamAssassin::Util qw(untaint_var untaint_file_path >>>>>> - proc_status_ok exit_status_str); >>>>>> +use Mail::SpamAssassin::Util qw(untaint_var); >>>>>> + >>>>>> use strict; >>>>>> use warnings; >>>>>> # use bytes; >>>>>> use re 'taint'; >>>>>> -use Storable; >>>>>> -use POSIX qw(PIPE_BUF WNOHANG _exit); >>>>>> - >>>>>> our @ISA = qw(Mail::SpamAssassin::Plugin); >>>>>> sub new { >>>>>> @@ -78,7 +74,7 @@ sub set_config { >>>>>> my ($self, $conf) = @_; >>>>>> my @cmds; >>>>>> -=head1 USER OPTIONS >>>>>> +=head1 ADMINISTRATOR OPTIONS >>>>>> =over 4 >>>>>> @@ -95,22 +91,7 @@ Whether to use Pyzor, if it is available. >>>>>> type => $Mail::SpamAssassin::Conf::CONF_TYPE_BOOL >>>>>> }); >>>>>> -=item pyzor_fork (0|1) (default: 0) >>>>>> - >>>>>> -Instead of running Pyzor synchronously, fork separate process for it and >>>>>> -read the results in later (similar to async DNS lookups). Increases >>>>>> -throughput. Experimental. >>>>>> - >>>>>> -=cut >>>>>> - >>>>>> - push(@cmds, { >>>>>> - setting => 'pyzor_fork', >>>>>> - is_admin => 1, >>>>>> - default => 0, >>>>>> - type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC, >>>>>> - }); >>>>>> - >>>>>> -=item pyzor_count_min NUMBER (default: 5) >>>>>> +=item pyzor_count_min NUMBER (default: 5) >>>>>> This option sets how often a message's body checksum must have been >>>>>> reported to the Pyzor server before SpamAssassin will consider the >>>>>> Pyzor >>>>>> @@ -128,54 +109,8 @@ set this to a relatively low value, e.g. C<5>. >>>>>> type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC >>>>>> }); >>>>>> - # Deprecated setting, the name makes no sense! >>>>>> - push (@cmds, { >>>>>> - setting => 'pyzor_max', >>>>>> - is_admin => 1, >>>>>> - type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC, >>>>>> - code => sub { >>>>>> - my ($self, $key, $value, $line) = @_; >>>>>> - warn("deprecated setting used, change pyzor_max to >>>>>> pyzor_count_min\n"); >>>>>> - if ($value !~ /^\d+$/) { >>>>>> - return $Mail::SpamAssassin::Conf::INVALID_VALUE; >>>>>> - } >>>>>> - $self->{pyzor_count_min} = $value; >>>>>> - } >>>>>> - }); >>>>>> - >>>>>> -=item pyzor_whitelist_min NUMBER (default: 10) >>>>>> - >>>>>> -This option sets how often a message's body checksum must have been >>>>>> -whitelisted to the Pyzor server for SpamAssassin to consider ignoring >>>>>> the >>>>>> -result. Final decision is made by pyzor_whitelist_factor. >>>>>> - >>>>>> -=cut >>>>>> - >>>>>> - push (@cmds, { >>>>>> - setting => 'pyzor_whitelist_min', >>>>>> - is_admin => 1, >>>>>> - default => 10, >>>>>> - type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC >>>>>> - }); >>>>>> - >>>>>> -=item pyzor_whitelist_factor NUMBER (default: 0.2) >>>>>> - >>>>>> -Ignore Pyzor result if REPORTCOUNT x NUMBER >= pyzor_whitelist_min. >>>>>> -For default setting this means: 50 reports requires 10 whitelistings. >>>>>> - >>>>>> -=cut >>>>>> - >>>>>> - push (@cmds, { >>>>>> - setting => 'pyzor_whitelist_factor', >>>>>> - is_admin => 1, >>>>>> - default => 0.2, >>>>>> - type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC >>>>>> - }); >>>>>> - >>>>>> =back >>>>>> -=head1 ADMINISTRATOR OPTIONS >>>>>> - >>>>>> =over 4 >>>>>> =item pyzor_timeout n (default: 5) >>>>>> @@ -210,478 +145,182 @@ removing one of them. >>>>>> type => $Mail::SpamAssassin::Conf::CONF_TYPE_DURATION >>>>>> }); >>>>>> -=item pyzor_options options >>>>>> +=item pyzor_whitelist_min NUMBER (default: 10) >>>>>> -Specify additional options to the pyzor(1) command. Please note that >>>>>> only >>>>>> -characters in the range [0-9A-Za-z =,._/-] are allowed for security >>>>>> reasons. >>>>>> +This option sets how often a message's body checksum must have been >>>>>> +whitelisted to the Pyzor server for SpamAssassin to consider ignoring >>>>>> the >>>>>> +result. Final decision is made by pyzor_whitelist_factor. >>>>>> =cut >>>>>> push (@cmds, { >>>>>> - setting => 'pyzor_options', >>>>>> + setting => 'pyzor_whitelist_min', >>>>>> is_admin => 1, >>>>>> - default => '', >>>>>> - type => $Mail::SpamAssassin::Conf::CONF_TYPE_STRING, >>>>>> - code => sub { >>>>>> - my ($self, $key, $value, $line) = @_; >>>>>> - if ($value !~ m{^([0-9A-Za-z =,._/-]+)$}) { >>>>>> - return $Mail::SpamAssassin::Conf::INVALID_VALUE; >>>>>> - } >>>>>> - $self->{pyzor_options} = $1; >>>>>> - } >>>>>> + default => 10, >>>>>> + type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC >>>>>> }); >>>>>> -=item pyzor_path STRING >>>>>> +=item pyzor_whitelist_factor NUMBER (default: 0.2) >>>>>> -This option tells SpamAssassin specifically where to find the C<pyzor> >>>>>> -client instead of relying on SpamAssassin to find it in the current >>>>>> -PATH. Note that if I<taint mode> is enabled in the Perl interpreter, >>>>>> -you should use this, as the current PATH will have been cleared. >>>>>> +Ignore Pyzor result if REPORTCOUNT x NUMBER >= pyzor_whitelist_min. >>>>>> +For default setting this means: 50 reports requires 10 whitelistings. >>>>>> =cut >>>>>> push (@cmds, { >>>>>> - setting => 'pyzor_path', >>>>>> + setting => 'pyzor_whitelist_factor', >>>>>> is_admin => 1, >>>>>> - default => undef, >>>>>> - type => $Mail::SpamAssassin::Conf::CONF_TYPE_STRING, >>>>>> - code => sub { >>>>>> - my ($self, $key, $value, $line) = @_; >>>>>> - if (!defined $value || !length $value) { >>>>>> - return $Mail::SpamAssassin::Conf::MISSING_REQUIRED_VALUE; >>>>>> - } >>>>>> - $value = untaint_file_path($value); >>>>>> - if (!-x $value) { >>>>>> - info("config: pyzor_path \"$value\" isn't an executable"); >>>>>> - return $Mail::SpamAssassin::Conf::INVALID_VALUE; >>>>>> - } >>>>>> - >>>>>> - $self->{pyzor_path} = $value; >>>>>> - } >>>>>> + default => 0.2, >>>>>> + type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC >>>>>> }); >>>>>> $conf->{parser}->register_commands(\@cmds); >>>>>> } >>>>>> sub is_pyzor_available { >>>>>> - my ($self) = @_; >>>>>> + my ($self) = @_; >>>>>> - my $pyzor = $self->{main}->{conf}->{pyzor_path} || >>>>>> - Mail::SpamAssassin::Util::find_executable_in_env_path('pyzor'); >>>>>> - >>>>>> - unless ($pyzor && -x $pyzor) { >>>>>> - dbg("pyzor: no pyzor executable found"); >>>>>> - $self->{pyzor_available} = 0; >>>>>> - return 0; >>>>>> - } >>>>>> - >>>>>> - # remember any found pyzor >>>>>> - $self->{main}->{conf}->{pyzor_path} = $pyzor; >>>>>> - >>>>>> - dbg("pyzor: pyzor is available: $pyzor"); >>>>>> - return 1; >>>>>> + local $@; >>>>>> + eval { >>>>>> + require Mail::SpamAssassin::Pyzor::Digest; >>>>>> + require Mail::SpamAssassin::Pyzor::Client; >>>>>> + }; >>>>>> + return $@ ? 0 : 1; >>>>>> } >>>>>> -sub finish_parsing_start { >>>>>> - my ($self, $opts) = @_; >>>>>> +sub get_pyzor_interface { >>>>>> + my ($self) = @_; >>>>>> - # If forking, hard adjust priority -100 to launch early >>>>>> - # Find rulenames from eval_to_rule mappings >>>>>> - if ($opts->{conf}->{pyzor_fork}) { >>>>>> - foreach (@{$opts->{conf}->{eval_to_rule}->{check_pyzor}}) { >>>>>> - dbg("pyzor: adjusting rule $_ priority to -100"); >>>>>> - $opts->{conf}->{priority}->{$_} = -100; >>>>>> - } >>>>>> + if (!$self->{main}->{conf}->{use_pyzor}) { >>>>>> + dbg("pyzor: use_pyzor option not enabled, disabling Pyzor"); >>>>>> + $self->{pyzor_interface} = "disabled"; >>>>>> + $self->{pyzor_available} = 0; >>>>>> + } >>>>>> + elsif ($self->is_pyzor_available()) { >>>>>> + $self->{pyzor_interface} = "pyzor"; >>>>>> + $self->{pyzor_available} = 1; >>>>>> + } >>>>>> + else { >>>>>> + dbg("pyzor: no pyzor found, disabling Pyzor"); >>>>>> + $self->{pyzor_available} = 0; >>>>>> } >>>>>> } >>>>>> sub check_pyzor { >>>>>> - my ($self, $pms, $full) = @_; >>>>>> - >>>>>> - return 0 if !$self->{pyzor_available}; >>>>>> - return 0 if !$self->{main}->{conf}->{use_pyzor}; >>>>>> - >>>>>> - return 0 if $pms->{pyzor_running}; >>>>>> - $pms->{pyzor_running} = 1; >>>>>> - >>>>>> - return 0 if !$self->is_pyzor_available(); >>>>>> - >>>>>> - my $timer = $self->{main}->time_method("check_pyzor"); >>>>>> + my ($self, $permsgstatus, $full) = @_; >>>>>> # initialize valid tags >>>>>> - $pms->{tag_data}->{PYZOR} = ''; >>>>>> - >>>>>> - # create fulltext tmpfile now (before possible forking) >>>>>> - $pms->{pyzor_tmpfile} = $pms->create_fulltext_tmpfile(); >>>>>> - >>>>>> - ## non-forking method >>>>>> - >>>>>> - if (!$self->{main}->{conf}->{pyzor_fork}) { >>>>>> - my @results = $self->pyzor_lookup($pms); >>>>>> - return $self->_check_result($pms, \@results); >>>>>> - } >>>>>> - >>>>>> - ## forking method >>>>>> - >>>>>> - $pms->{pyzor_rulename} = $pms->get_current_eval_rule_name(); >>>>>> - $pms->rule_pending($pms->{pyzor_rulename}); # mark async >>>>>> - >>>>>> - # create socketpair for communication >>>>>> - $pms->{pyzor_backchannel} = >>>>>> Mail::SpamAssassin::SubProcBackChannel->new(); >>>>>> - my $back_selector = ''; >>>>>> - $pms->{pyzor_backchannel}->set_selector(\$back_selector); >>>>>> - eval { >>>>>> - $pms->{pyzor_backchannel}->setup_backchannel_parent_pre_fork(); >>>>>> - } or do { >>>>>> - dbg("pyzor: backchannel pre-setup failed: $@"); >>>>>> - delete $pms->{pyzor_backchannel}; >>>>>> - return 0; >>>>>> - }; >>>>>> + $permsgstatus->{tag_data}->{PYZOR} = ""; >>>>>> - my $pid = fork(); >>>>>> - if (!defined $pid) { >>>>>> - info("pyzor: child fork failed: $!"); >>>>>> - delete $pms->{pyzor_backchannel}; >>>>>> - return 0; >>>>>> - } >>>>>> - if (!$pid) { >>>>>> - $0 = "$0 (pyzor)"; >>>>>> - $SIG{CHLD} = 'DEFAULT'; >>>>>> - $SIG{PIPE} = 'IGNORE'; >>>>>> - $SIG{$_} = sub { >>>>>> - eval { dbg("pyzor: child process $$ caught signal $_[0]"); }; >>>>>> - _exit(6); # avoid END and destructor processing >>>>>> - kill('KILL',$$); # still kicking? die! >>>>>> - } foreach qw(INT HUP TERM TSTP QUIT USR1 USR2); >>>>>> - dbg("pyzor: child process $$ forked"); >>>>>> - $pms->{pyzor_backchannel}->setup_backchannel_child_post_fork(); >>>>>> - my @results = $self->pyzor_lookup($pms); >>>>>> - my $backmsg; >>>>>> - eval { >>>>>> - $backmsg = Storable::freeze(\@results); >>>>>> - }; >>>>>> - if ($@) { >>>>>> - dbg("pyzor: child return value freeze failed: $@"); >>>>>> - _exit(0); # avoid END and destructor processing >>>>>> - } >>>>>> - if (!syswrite($pms->{pyzor_backchannel}->{parent}, $backmsg)) { >>>>>> - dbg("pyzor: child backchannel write failed: $!"); >>>>>> - } >>>>>> - _exit(0); # avoid END and destructor processing >>>>>> - } >>>>>> - >>>>>> - $pms->{pyzor_pid} = $pid; >>>>>> + my $timer = $self->{main}->time_method("check_pyzor"); >>>>>> - eval { >>>>>> - $pms->{pyzor_backchannel}->setup_backchannel_parent_post_fork($pid); >>>>>> - } or do { >>>>>> - dbg("pyzor: backchannel post-setup failed: $@"); >>>>>> - delete $pms->{pyzor_backchannel}; >>>>>> - return 0; >>>>>> - }; >>>>>> + $self->get_pyzor_interface(); >>>>>> + return 0 unless $self->{pyzor_available}; >>>>>> - return 0; >>>>>> + return $self->pyzor_lookup($permsgstatus, $full); >>>>>> } >>>>>> sub pyzor_lookup { >>>>>> - my ($self, $pms) = @_; >>>>>> - >>>>>> - my $conf = $self->{main}->{conf}; >>>>>> - my $timeout = $conf->{pyzor_timeout}; >>>>>> - >>>>>> - # note: not really tainted, this came from system configuration file >>>>>> - my $path = untaint_file_path($conf->{pyzor_path}); >>>>>> - my $opts = untaint_var($conf->{pyzor_options}) || ''; >>>>>> - >>>>>> - $pms->enter_helper_run_mode(); >>>>>> - >>>>>> - my $pid; >>>>>> - my @resp; >>>>>> - my $timer = Mail::SpamAssassin::Timeout->new( >>>>>> - { secs => $timeout, deadline => $pms->{master_deadline} }); >>>>>> - my $err = $timer->run_and_catch(sub { >>>>>> - local $SIG{PIPE} = sub { die "__brokenpipe__ignore__\n" }; >>>>>> - >>>>>> - dbg("pyzor: opening pipe: ". >>>>>> - join(' ', $path, $opts, "check", "<".$pms->{pyzor_tmpfile})); >>>>>> - >>>>>> - $pid = Mail::SpamAssassin::Util::helper_app_pipe_open(*PYZOR, >>>>>> - $pms->{pyzor_tmpfile}, 1, $path, split(' ', $opts), "check"); >>>>>> - $pid or die "$!\n"; >>>>>> - >>>>>> - # read+split avoids a Perl I/O bug (Bug 5985) >>>>>> - my($inbuf, $nread); >>>>>> - my $resp = ''; >>>>>> - while ($nread = read(PYZOR, $inbuf, 8192)) { $resp .= $inbuf } >>>>>> - defined $nread or die "error reading from pipe: $!"; >>>>>> - @resp = split(/^/m, $resp, -1); >>>>>> - >>>>>> - my $errno = 0; >>>>>> - close PYZOR or $errno = $!; >>>>>> - if (proc_status_ok($?, $errno)) { >>>>>> - dbg("pyzor: [%s] finished successfully", $pid); >>>>>> - } elsif (proc_status_ok($?, $errno, 0, 1)) { # sometimes it exits >>>>>> with 1 >>>>>> - dbg("pyzor: [%s] finished: %s", $pid, exit_status_str($?, >>>>>> $errno)); >>>>>> - } else { >>>>>> - info("pyzor: [%s] error: %s", $pid, exit_status_str($?, $errno)); >>>>>> - } >>>>>> - >>>>>> - }); >>>>>> - >>>>>> - if (defined(fileno(*PYZOR))) { # still open >>>>>> - if ($pid) { >>>>>> - if (kill('TERM', $pid)) { >>>>>> - dbg("pyzor: killed stale helper [$pid]"); >>>>>> - } else { >>>>>> - dbg("pyzor: killing helper application [$pid] failed: $!"); >>>>>> - } >>>>>> - } >>>>>> - my $errno = 0; >>>>>> - close PYZOR or $errno = $!; >>>>>> - proc_status_ok($?, $errno) >>>>>> - or info("pyzor: [%s] error: %s", $pid, exit_status_str($?, >>>>>> $errno)); >>>>>> - } >>>>>> - >>>>>> - $pms->leave_helper_run_mode(); >>>>>> - >>>>>> - if ($timer->timed_out()) { >>>>>> - dbg("pyzor: check timed out after $timeout seconds"); >>>>>> - return (); >>>>>> - } elsif ($err) { >>>>>> - chomp $err; >>>>>> - info("pyzor: check failed: $err"); >>>>>> - return (); >>>>>> - } >>>>>> - >>>>>> - return @resp; >>>>>> -} >>>>>> - >>>>>> -sub check_tick { >>>>>> - my ($self, $opts) = @_; >>>>>> - $self->_check_forked_result($opts->{permsgstatus}, 0); >>>>>> -} >>>>>> - >>>>>> -sub check_cleanup { >>>>>> - my ($self, $opts) = @_; >>>>>> - $self->_check_forked_result($opts->{permsgstatus}, 1); >>>>>> -} >>>>>> - >>>>>> -sub _check_forked_result { >>>>>> - my ($self, $pms, $finish) = @_; >>>>>> - >>>>>> - return 0 if !$pms->{pyzor_backchannel}; >>>>>> - return 0 if !$pms->{pyzor_pid}; >>>>>> + my ( $self, $permsgstatus, $fulltext ) = @_; >>>>>> + my $conf = $self->{main}->{conf}; >>>>>> + my $timeout = $conf->{pyzor_timeout}; >>>>>> + >>>>>> + my $client = ( $self->{'_pyzor_client'} ||= >>>>>> Mail::SpamAssassin::Pyzor::Client->new( 'timeout' => $timeout ) ); >>>>>> + my $digest = Mail::SpamAssassin::Pyzor::Digest::get( $fulltext ); >>>>>> + >>>>>> + local $@; >>>>>> + my $ref = eval { $client->check($digest); }; >>>>>> + dbg("pyzor: got response: $client->{'_server_host'}"); >>>>>> + # $client reply must be an hash >>>>>> + return 0 if (not (ref $ref eq ref {})); >>>>>> + if ($@) { >>>>>> + my $err = $@; >>>>>> - my $timer = $self->{main}->time_method("check_pyzor"); >>>>>> + $err = eval { $err->get_message() } || $err; >>>>>> - $pms->{pyzor_abort} = $pms->{deadline_exceeded} || >>>>>> $pms->{shortcircuited}; >>>>>> - >>>>>> - my $kid_pid = $pms->{pyzor_pid}; >>>>>> - # if $finish, force waiting for the child >>>>>> - my $pid = waitpid($kid_pid, $finish && !$pms->{pyzor_abort} ? 0 : >>>>>> WNOHANG); >>>>>> - if ($pid == 0) { >>>>>> - #dbg("pyzor: child process $kid_pid not finished yet, trying >>>>>> later"); >>>>>> - if ($pms->{pyzor_abort}) { >>>>>> - dbg("pyzor: bailing out due to deadline/shortcircuit"); >>>>>> - kill('TERM', $kid_pid); >>>>>> - if (waitpid($kid_pid, WNOHANG) == 0) { >>>>>> - sleep(1); >>>>>> - if (waitpid($kid_pid, WNOHANG) == 0) { >>>>>> - dbg("pyzor: child process $kid_pid still alive, KILL"); >>>>>> - kill('KILL', $kid_pid); >>>>>> - waitpid($kid_pid, 0); >>>>>> + warn("pyzor: check failed: $err\n"); >>>>>> + return 0; >>>>>> + } elsif ( defined $ref->{'Code'} and $ref->{'Code'} ne 200 ) { >>>>>> + if(defined $ref->{'Code'} and defined $ref->{'Diag'}) { >>>>>> + dbg("pyzor: check failed with invalid code: $ref->{'Code'}: >>>>>> $ref->{'Diag'}"); >>>>>> + } else { >>>>>> + dbg("pyzor: check failed with undefined code"); >>>>>> } >>>>>> - } >>>>>> - delete $pms->{pyzor_pid}; >>>>>> - delete $pms->{pyzor_backchannel}; >>>>>> + return 0; >>>>>> } >>>>>> - return 0; >>>>>> - } elsif ($pid == -1) { >>>>>> - # child does not exist? >>>>>> - dbg("pyzor: child process $kid_pid already handled?"); >>>>>> - delete $pms->{pyzor_backchannel}; >>>>>> - return 0; >>>>>> - } >>>>>> - $pms->rule_ready($pms->{pyzor_rulename}); # mark rule ready for metas >>>>>> + my $pyzor_count = untaint_var($ref->{'Count'}) + 0; >>>>>> + my $pyzor_whitelisted = untaint_var($ref->{'WL-Count'}) + 0; >>>>>> + my $count_min = $conf->{pyzor_count_min}; >>>>>> + my $wl_min = $conf->{pyzor_whitelist_min}; >>>>>> - dbg("pyzor: child process $kid_pid finished, reading results"); >>>>>> + my $wl_limit = $pyzor_whitelisted >= $wl_min ? >>>>>> + $pyzor_count * $conf->{pyzor_whitelist_factor} : 0; >>>>>> - my $backmsg; >>>>>> - my $ret = sysread($pms->{pyzor_backchannel}->{latest_kid_fh}, >>>>>> $backmsg, PIPE_BUF); >>>>>> - if (!defined $ret || $ret == 0) { >>>>>> - dbg("pyzor: could not read result from child: ".($ret == 0 ? 0 : >>>>>> $!)); >>>>>> - delete $pms->{pyzor_backchannel}; >>>>>> - return 0; >>>>>> - } >>>>>> - >>>>>> - delete $pms->{pyzor_backchannel}; >>>>>> + $permsgstatus->set_tag('PYZOR', "Reported $pyzor_count times, >>>>>> whitelisted $pyzor_whitelisted times."); >>>>>> - my $results; >>>>>> - eval { >>>>>> - $results = Storable::thaw($backmsg); >>>>>> - }; >>>>>> - if ($@) { >>>>>> - dbg("pyzor: child return value thaw failed: $@"); >>>>>> - return; >>>>>> - } >>>>>> - >>>>>> - $self->_check_result($pms, $results); >>>>>> -} >>>>>> + dbg("pyzor: result: COUNT=$pyzor_count/$count_min >>>>>> WHITELIST=$pyzor_whitelisted/$wl_min/%.1f", >>>>>> + $wl_limit); >>>>>> -sub _check_result { >>>>>> - my ($self, $pms, $results) = @_; >>>>>> - >>>>>> - if (!@$results) { >>>>>> - dbg("pyzor: no response from server"); >>>>>> - return 0; >>>>>> - } >>>>>> - >>>>>> - my $count = 0; >>>>>> - my $count_wl = 0; >>>>>> - foreach my $res (@$results) { >>>>>> - chomp($res); >>>>>> - if ($res =~ /^Traceback/) { >>>>>> - info("pyzor: internal error, python traceback seen in response: >>>>>> $res"); >>>>>> + # Empty body etc results in same hash, we should skip very large >>>>>> numbers.. >>>>>> + if ($pyzor_count >= 1000000 || $pyzor_whitelisted >= 10000) { >>>>>> + dbg("pyzor: result exceeded hardcoded limits, ignoring: count/wl >>>>>> 1000000/10000"); >>>>>> return 0; >>>>>> } >>>>>> - dbg("pyzor: got response: $res"); >>>>>> - # this regexp is intended to be a little bit forgiving >>>>>> - if ($res =~ /^\S+\t.*?\t(\d+)\t(\d+)\s*$/) { >>>>>> - # until pyzor servers can sync their DBs, >>>>>> - # sum counts obtained from all servers >>>>>> - $count += untaint_var($1)+0; # crazy but needs untainting >>>>>> - $count_wl += untaint_var($2)+0; >>>>>> - } else { >>>>>> - # warn on failures to parse >>>>>> - info("pyzor: failure to parse response \"$res\""); >>>>>> - } >>>>>> - } >>>>>> - >>>>>> - my $conf = $self->{main}->{conf}; >>>>>> - >>>>>> - my $count_min = $conf->{pyzor_count_min}; >>>>>> - my $wl_min = $conf->{pyzor_whitelist_min}; >>>>>> - my $wl_limit = $count_wl >= $wl_min ? >>>>>> - $count * $conf->{pyzor_whitelist_factor} : 0; >>>>>> - >>>>>> - dbg("pyzor: result: COUNT=$count/$count_min >>>>>> WHITELIST=$count_wl/$wl_min/%.1f", >>>>>> - $wl_limit); >>>>>> - $pms->set_tag('PYZOR', "Reported $count times, whitelisted $count_wl >>>>>> times."); >>>>>> - >>>>>> - # Empty body etc results in same hash, we should skip very large >>>>>> numbers.. >>>>>> - if ($count >= 1000000 || $count_wl >= 10000) { >>>>>> - dbg("pyzor: result exceeded hardcoded limits, ignoring: count/wl >>>>>> 1000000/10000"); >>>>>> - return 0; >>>>>> - } >>>>>> - >>>>>> - # Whitelisted? >>>>>> - if ($wl_limit && $count_wl >= $wl_limit) { >>>>>> - dbg("pyzor: message whitelisted"); >>>>>> - return 0; >>>>>> - } >>>>>> + # Whitelisted? >>>>>> + if ($wl_limit && $pyzor_whitelisted >= $wl_limit) { >>>>>> + dbg("pyzor: message whitelisted"); >>>>>> + return 0; >>>>>> + } >>>>>> - if ($count >= $count_min) { >>>>>> - if ($conf->{pyzor_fork}) { >>>>>> - # forked needs to run got_hit() >>>>>> - $pms->got_hit($pms->{pyzor_rulename}, "", ruletype => 'eval'); >>>>>> + if ( $pyzor_count >= $count_min ) { >>>>>> + return 1; >>>>>> } >>>>>> - return 1; >>>>>> - } >>>>>> - return 0; >>>>>> + return 0; >>>>>> } >>>>>> sub plugin_report { >>>>>> my ($self, $options) = @_; >>>>>> - return if !$self->{pyzor_available}; >>>>>> - return if !$self->{main}->{conf}->{use_pyzor}; >>>>>> - return if $options->{report}->{options}->{dont_report_to_pyzor}; >>>>>> - return if !$self->is_pyzor_available(); >>>>>> - >>>>>> - # use temporary file: open2() is unreliable due to buffering under >>>>>> spamd >>>>>> - my $tmpf = >>>>>> $options->{report}->create_fulltext_tmpfile($options->{text}); >>>>>> - if ($self->pyzor_report($options, $tmpf)) { >>>>>> - $options->{report}->{report_available} = 1; >>>>>> - info("reporter: spam reported to Pyzor"); >>>>>> - $options->{report}->{report_return} = 1; >>>>>> - } >>>>>> - else { >>>>>> - info("reporter: could not report spam to Pyzor"); >>>>>> - } >>>>>> - $options->{report}->delete_fulltext_tmpfile($tmpf); >>>>>> + return unless $self->{pyzor_available}; >>>>>> + return unless $self->{main}->{conf}->{use_pyzor}; >>>>>> - return 1; >>>>>> + if (!$options->{report}->{options}->{dont_report_to_pyzor} && >>>>>> $self->is_pyzor_available()) >>>>>> + { >>>>>> + if ($self->pyzor_report($options)) { >>>>>> + $options->{report}->{report_available} = 1; >>>>>> + info("reporter: spam reported to Pyzor"); >>>>>> + $options->{report}->{report_return} = 1; >>>>>> + } >>>>>> + else { >>>>>> + info("reporter: could not report spam to Pyzor"); >>>>>> + } >>>>>> + } >>>>>> } >>>>>> sub pyzor_report { >>>>>> - my ($self, $options, $tmpf) = @_; >>>>>> - >>>>>> - # note: not really tainted, this came from system configuration file >>>>>> - my $path = >>>>>> untaint_file_path($options->{report}->{conf}->{pyzor_path}); >>>>>> - my $opts = untaint_var($options->{report}->{conf}->{pyzor_options}) >>>>>> || ''; >>>>>> + my ( $self, $options ) = @_; >>>>>> - my $timeout = $self->{main}->{conf}->{pyzor_timeout}; >>>>>> + my $timeout = $self->{main}->{conf}->{pyzor_timeout}; >>>>>> - $options->{report}->enter_helper_run_mode(); >>>>>> + my $client = ( $self->{'_pyzor_client'} ||= >>>>>> Mail::SpamAssassin::Pyzor::Client->new( 'timeout' => $timeout ) ); >>>>>> - my $timer = Mail::SpamAssassin::Timeout->new({ secs => $timeout }); >>>>>> - my $err = $timer->run_and_catch(sub { >>>>>> + my $digest = Mail::SpamAssassin::Pyzor::Digest::get( >>>>>> $options->{'text'} ); >>>>>> - local $SIG{PIPE} = sub { die "__brokenpipe__ignore__\n" }; >>>>>> - >>>>>> - dbg("pyzor: opening pipe: " . join(' ', $path, $opts, "report", "< >>>>>> $tmpf")); >>>>>> - >>>>>> - my $pid = Mail::SpamAssassin::Util::helper_app_pipe_open(*PYZOR, >>>>>> - $tmpf, 1, $path, split(' ', $opts), "report"); >>>>>> - $pid or die "$!\n"; >>>>>> - >>>>>> - my($inbuf,$nread,$nread_all); $nread_all = 0; >>>>>> - # response is ignored, just check its existence >>>>>> - while ( $nread=read(PYZOR,$inbuf,8192) ) { $nread_all += $nread } >>>>>> - defined $nread or die "error reading from pipe: $!"; >>>>>> - >>>>>> - dbg("pyzor: empty response") if $nread_all < 1; >>>>>> - >>>>>> - my $errno = 0; close PYZOR or $errno = $!; >>>>>> - # closing a pipe also waits for the process executing on the pipe to >>>>>> - # complete, no need to explicitly call waitpid >>>>>> - # my $child_stat = waitpid($pid,0) > 0 ? $? : undef; >>>>>> - if (proc_status_ok($?,$errno, 0)) { >>>>>> - dbg("pyzor: [%s] reporter finished successfully", $pid); >>>>>> - } else { >>>>>> - info("pyzor: [%s] reporter error: %s", $pid, >>>>>> exit_status_str($?,$errno)); >>>>>> + local $@; >>>>>> + my $ref = eval { $client->report($digest); }; >>>>>> + if ($@) { >>>>>> + warn("pyzor: report failed: $@"); >>>>>> + return 0; >>>>>> } >>>>>> - >>>>>> - }); >>>>>> - >>>>>> - $options->{report}->leave_helper_run_mode(); >>>>>> - >>>>>> - if ($timer->timed_out()) { >>>>>> - dbg("reporter: pyzor report timed out after $timeout seconds"); >>>>>> - return 0; >>>>>> - } >>>>>> - >>>>>> - if ($err) { >>>>>> - chomp $err; >>>>>> - if ($err eq '__brokenpipe__ignore__') { >>>>>> - dbg("reporter: pyzor report failed: broken pipe"); >>>>>> - } else { >>>>>> - warn("reporter: pyzor report failed: $err\n"); >>>>>> + elsif ( $ref->{'Code'} ne 200 ) { >>>>>> + dbg("pyzor: report failed with invalid code: $ref->{'Code'}: >>>>>> $ref->{'Diag'}"); >>>>>> + return 0; >>>>>> } >>>>>> - return 0; >>>>>> - } >>>>>> - return 1; >>>>>> + return 1; >>>>>> } >>>>>> -# Version features >>>>>> -sub has_fork { 1 } >>>>>> - >>>>>> 1; >>>>>> - >>>>>> -=back >>>>>> - >>>>>> -=cut >>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor.pm >>>>>> b/lib/Mail/SpamAssassin/Pyzor.pm >>>>>> new file mode 100644 >>>>>> index 0000000..8ac27f4 >>>>>> --- /dev/null >>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor.pm >>>>>> @@ -0,0 +1,56 @@ >>>>>> +package Mail::SpamAssassin::Pyzor; >>>>>> + >>>>>> +# Copyright 2018 cPanel, LLC. >>>>>> +# All rights reserved. >>>>>> +# http://cpanel.net >>>>>> +# >>>>>> +# <@LICENSE> >>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more >>>>>> +# contributor license agreements. See the NOTICE file distributed with >>>>>> +# this work for additional information regarding copyright ownership. >>>>>> +# The ASF licenses this file to you under the Apache License, Version >>>>>> 2.0 >>>>>> +# (the "License"); you may not use this file except in compliance with >>>>>> +# the License. You may obtain a copy of the License at: >>>>>> +# >>>>>> +# http://www.apache.org/licenses/LICENSE-2.0 >>>>>> +# >>>>>> +# Unless required by applicable law or agreed to in writing, software >>>>>> +# distributed under the License is distributed on an "AS IS" BASIS, >>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>>>>> implied. >>>>>> +# See the License for the specific language governing permissions and >>>>>> +# limitations under the License. >>>>>> +# </@LICENSE> >>>>>> +# >>>>>> + >>>>>> +use strict; >>>>>> +use warnings; >>>>>> + >>>>>> +our $VERSION = '0.06_01'; >>>>>> + >>>>>> +=encoding utf-8 >>>>>> + >>>>>> +=head1 NAME >>>>>> + >>>>>> +Mail::SpamAssassin::Pyzor - Pyzor spam filtering in Perl >>>>>> + >>>>>> +=head1 DESCRIPTION >>>>>> + >>>>>> +This distribution contains Perl implementations of parts of >>>>>> +L<Pyzor|http://pyzor.org>, a tool for use in spam email filtering. >>>>>> +It is intended for use with L<Mail::SpamAssassin> but may be useful >>>>>> +in other contexts. >>>>>> + >>>>>> +See the following modules for information on specific tools that >>>>>> +the distribution includes: >>>>>> + >>>>>> +=over >>>>>> + >>>>>> +=item * L<Mail::SpamAssassin::Pyzor::Client> >>>>>> + >>>>>> +=item * L<Mail::SpamAssassin::Pyzor::Digest> >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +1; >>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor/Client.pm >>>>>> b/lib/Mail/SpamAssassin/Pyzor/Client.pm >>>>>> new file mode 100644 >>>>>> index 0000000..ccff868 >>>>>> --- /dev/null >>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor/Client.pm >>>>>> @@ -0,0 +1,415 @@ >>>>>> +package Mail::SpamAssassin::Pyzor::Client; >>>>>> + >>>>>> +# Copyright 2018 cPanel, LLC. >>>>>> +# All rights reserved. >>>>>> +# http://cpanel.net >>>>>> +# >>>>>> +# <@LICENSE> >>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more >>>>>> +# contributor license agreements. See the NOTICE file distributed with >>>>>> +# this work for additional information regarding copyright ownership. >>>>>> +# The ASF licenses this file to you under the Apache License, Version >>>>>> 2.0 >>>>>> +# (the "License"); you may not use this file except in compliance with >>>>>> +# the License. You may obtain a copy of the License at: >>>>>> +# >>>>>> +# http://www.apache.org/licenses/LICENSE-2.0 >>>>>> +# >>>>>> +# Unless required by applicable law or agreed to in writing, software >>>>>> +# distributed under the License is distributed on an "AS IS" BASIS, >>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>>>>> implied. >>>>>> +# See the License for the specific language governing permissions and >>>>>> +# limitations under the License. >>>>>> +# </@LICENSE> >>>>>> +# >>>>>> + >>>>>> +use strict; >>>>>> +use warnings; >>>>>> + >>>>>> +=encoding utf-8 >>>>>> + >>>>>> +=head1 NAME >>>>>> + >>>>>> +Mail::SpamAssassin::Pyzor::Client - Pyzor client logic >>>>>> + >>>>>> +=head1 SYNOPSIS >>>>>> + >>>>>> + use Mail::SpamAssassin::Pyzor::Client (); >>>>>> + use Mail::SpamAssassin::Pyzor::Digest (); >>>>>> + >>>>>> + my $client = Mail::SpamAssassin::Pyzor::Client->new(); >>>>>> + >>>>>> + my $digest = Mail::SpamAssassin::Pyzor::Digest::get( $msg ); >>>>>> + >>>>>> + my $check_ref = $client->check($digest); >>>>>> + die $check_ref->{'Diag'} if $check_ref->{'Code'} ne '200'; >>>>>> + >>>>>> + my $report_ref = $client->report($digest); >>>>>> + die $report_ref->{'Diag'} if $report_ref->{'Code'} ne '200'; >>>>>> + >>>>>> +=head1 DESCRIPTION >>>>>> + >>>>>> +A bare-bones L<Pyzor|http://pyzor.org> client that currently only >>>>>> +implements the functionality needed for L<Mail::SpamAssassin>. >>>>>> + >>>>>> +=head1 PROTOCOL DETAILS >>>>>> + >>>>>> +The Pyzor protocol is not a published standard, and there appears to be >>>>>> +no meaningful public documentation. What follows is enough information, >>>>>> +largely gleaned through forum posts and reverse engineering, to >>>>>> facilitate >>>>>> +effective use of this module: >>>>>> + >>>>>> +Pyzor is an RPC-oriented, message-based protocol. Each message >>>>>> +is a simple dictionary of 7-bit ASCII keys and values. Server responses >>>>>> +always include at least the following: >>>>>> + >>>>>> +=over >>>>>> + >>>>>> +=item * C<Code> - Similar to HTTP status codes; anything besides C<200> >>>>>> +is an error. >>>>>> + >>>>>> +=item * C<Diag> - Similar to HTTP status reasons: a text description >>>>>> +of the status. >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +(NB: There are additional standard response headers that are useful >>>>>> only for >>>>>> +the protocol itself and thus are not part of this module???s returns.) >>>>>> + >>>>>> +=head2 Reliability >>>>>> + >>>>>> +Pyzor uses UDP rather than TCP, so no message is guaranteed to reach its >>>>>> +destination. A transmission failure can happen in either the request or >>>>>> +the response; in either case, a timeout error will result. Such errors >>>>>> +are represented as thrown instances of L<Mail::Pyzor::X::Timeout>. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +our $VERSION = '0.04'; >>>>>> + >>>>>> +our $DEFAULT_SERVER_HOST = 'public.pyzor.org'; >>>>>> +our $DEFAULT_SERVER_PORT = 24441; >>>>>> +our $DEFAULT_USERNAME = 'anonymous'; >>>>>> +our $DEFAULT_PASSWORD = ''; >>>>>> +our $DEFAULT_OP_SPEC = '20,3,60,3'; >>>>>> +our $PYZOR_PROTOCOL_VERSION = 2.1; >>>>>> +our $DEFAULT_TIMEOUT = 3.5; >>>>>> +our $READ_SIZE = 8192; >>>>>> + >>>>>> +use IO::Socket::INET (); >>>>>> +use Digest::SHA qw(sha1 sha1_hex); >>>>>> + >>>>>> +my @hash_order = ( 'Op', 'Op-Digest', 'Op-Spec', 'Thread', 'PV', >>>>>> 'User', 'Time', 'Sig' ); >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head1 CONSTRUCTOR >>>>>> + >>>>>> +=head2 new(%OPTS) >>>>>> + >>>>>> +Create a new pyzor client. >>>>>> + >>>>>> +=over 2 >>>>>> + >>>>>> +=item Input >>>>>> + >>>>>> +%OPTS are (all optional): >>>>>> + >>>>>> +=over 3 >>>>>> + >>>>>> +=item * C<server_host> - The pyzor server host to connect to (default is >>>>>> +C<public.pyzor.org>) >>>>>> + >>>>>> +=item * C<server_port> - The pyzor server port to connect to (default is >>>>>> +24441) >>>>>> + >>>>>> +=item * C<username> - The username to present to the pyzor server >>>>>> (default >>>>>> +is C<anonymous>) >>>>>> + >>>>>> +=item * C<password> - The password to present to the pyzor server >>>>>> (default >>>>>> +is empty) >>>>>> + >>>>>> +=item * C<timeout> - The maximum time, in seconds, to wait for a >>>>>> response >>>>>> +from the pyzor server (defeault is 3.5) >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=item Output >>>>>> + >>>>>> +=over 3 >>>>>> + >>>>>> +Returns a L<Mail::SpamAssassin::Pyzor::Client> object. >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub new { >>>>>> + my ( $class, %OPTS ) = @_; >>>>>> + >>>>>> + return bless { >>>>>> + '_server_host' => $OPTS{'server_host'} || $DEFAULT_SERVER_HOST, >>>>>> + '_server_port' => $OPTS{'server_port'} || $DEFAULT_SERVER_PORT, >>>>>> + '_username' => $OPTS{'username'} || $DEFAULT_USERNAME, >>>>>> + '_password' => $OPTS{'password'} || $DEFAULT_PASSWORD, >>>>>> + '_op_spec' => $DEFAULT_OP_SPEC, >>>>>> + '_timeout' => $OPTS{'timeout'} || $DEFAULT_TIMEOUT, >>>>>> + }, $class; >>>>>> +} >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head1 REQUEST METHODS >>>>>> + >>>>>> +=head2 report($digest) >>>>>> + >>>>>> +Report the digest of a spam message to the pyzor server. This function >>>>>> +will throw if a messaging failure or timeout happens. >>>>>> + >>>>>> +=over 2 >>>>>> + >>>>>> +=item Input >>>>>> + >>>>>> +=over 3 >>>>>> + >>>>>> +=item $digest C<SCALAR> >>>>>> + >>>>>> +The message digest to report, as given by >>>>>> +C<Mail::SpamAssassin::Pyzor::Digest::get()>. >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=item Output >>>>>> + >>>>>> +=over 3 >>>>>> + >>>>>> +=item C<HASHREF> >>>>>> + >>>>>> +Returns a hashref of the standard attributes noted above. >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub report { >>>>>> + my ( $self, $digest ) = @_; >>>>>> + >>>>>> + my $msg_ref = $self->_get_base_msg( 'report', $digest ); >>>>>> + >>>>>> + $msg_ref->{'Op-Spec'} = $self->{'_op_spec'}; >>>>>> + >>>>>> + return $self->_send_receive_msg($msg_ref); >>>>>> +} >>>>>> + >>>>>> +=head2 check($digest) >>>>>> + >>>>>> +Check the digest of a message to see if >>>>>> +the pyzor server has a report for it. This function >>>>>> +will throw if a messaging failure or timeout happens. >>>>>> + >>>>>> +=over 2 >>>>>> + >>>>>> +=item Input >>>>>> + >>>>>> +=over 3 >>>>>> + >>>>>> +=item $digest C<SCALAR> >>>>>> + >>>>>> +The message digest to check, as given by >>>>>> +C<Mail::SpamAssassin::Pyzor::Digest::get()>. >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=item Output >>>>>> + >>>>>> +=over 3 >>>>>> + >>>>>> +=item C<HASHREF> >>>>>> + >>>>>> +Returns a hashref of the standard attributes noted above >>>>>> +as well as the following: >>>>>> + >>>>>> +=over >>>>>> + >>>>>> +=item * C<Count> - The number of reports the server has received >>>>>> +for the given digest. >>>>>> + >>>>>> +=item * C<WL-Count> - The number of whitelist requests the server has >>>>>> received >>>>>> +for the given digest. >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=back >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub check { >>>>>> + my ( $self, $digest ) = @_; >>>>>> + >>>>>> + return $self->_send_receive_msg( $self->_get_base_msg( 'check', >>>>>> $digest ) ); >>>>>> +} >>>>>> + >>>>>> +# ---------------------------------------- >>>>>> + >>>>>> +sub _send_receive_msg { >>>>>> + my ( $self, $msg_ref ) = @_; >>>>>> + >>>>>> + my $thread_id = $msg_ref->{'Thread'} or warn 'No thread ID?'; >>>>>> + >>>>>> + $self->_sign_msg($msg_ref); >>>>>> + >>>>>> + return $self->_do_send_receive( >>>>>> + $self->_generate_packet_from_message($msg_ref) . "\n\n", >>>>>> + $thread_id, >>>>>> + ); >>>>>> +} >>>>>> + >>>>>> +sub _get_base_msg { >>>>>> + my ( $self, $op, $digest ) = @_; >>>>>> + >>>>>> + die "Implementor error: op is required" if !$op; >>>>>> + die "error: digest is required" if !$digest; >>>>>> + >>>>>> + return { >>>>>> + 'User' => $self->{'_username'}, >>>>>> + 'PV' => $PYZOR_PROTOCOL_VERSION, >>>>>> + 'Time' => time(), >>>>>> + 'Op' => $op, >>>>>> + 'Op-Digest' => $digest, >>>>>> + 'Thread' => $self->_generate_thread_id() >>>>>> + }; >>>>>> +} >>>>>> + >>>>>> +sub _do_send_receive { >>>>>> + my ( $self, $packet, $thread_id ) = @_; >>>>>> + >>>>>> + my $sock = $self->_get_connection_or_die(); >>>>>> + >>>>>> + $self->_send_packet( $sock, $packet ); >>>>>> + my $response = $self->_receive_packet( $sock, $thread_id ); >>>>>> + >>>>>> + return 0 if not defined $response; >>>>>> + >>>>>> + my $resp_hr = { map { ( split(m{: }) )[ 0, 1 ] } split( m{\n}, >>>>>> $response ) }; >>>>>> + >>>>>> + delete $resp_hr->{'Thread'}; >>>>>> + >>>>>> + my $response_pv = delete $resp_hr->{'PV'}; >>>>>> + >>>>>> + if ( $PYZOR_PROTOCOL_VERSION ne $response_pv ) { >>>>>> + warn "Unexpected protocol version ($response_pv) in Pyzor >>>>>> response!"; >>>>>> + } >>>>>> + >>>>>> + return $resp_hr; >>>>>> +} >>>>>> + >>>>>> +sub _receive_packet { >>>>>> + my ( $self, $sock, $thread_id ) = @_; >>>>>> + >>>>>> + my $timeout = $self->{'_timeout'} * 1000; >>>>>> + >>>>>> + my $end_time = time + $self->{'_timeout'}; >>>>>> + >>>>>> + $sock->blocking(0); >>>>>> + my $response = ''; >>>>>> + my $rout = ''; >>>>>> + my $rin = ''; >>>>>> + vec( $rin, fileno($sock), 1 ) = 1; >>>>>> + >>>>>> + while (1) { >>>>>> + my $time_left = $end_time - time; >>>>>> + >>>>>> + if ( $time_left <= 0 ) { >>>>>> + warn("Did not receive a response from the pyzor server >>>>>> $self->{'_server_host'}:$self->{'_server_port'} for $self->{'_timeout'} >>>>>> seconds!"); >>>>>> + return; >>>>>> + } >>>>>> + >>>>>> + my $bytes = sysread( $sock, $response, $READ_SIZE, length >>>>>> $response ); >>>>>> + if ( !defined($bytes) && !$!{'EAGAIN'} && !$!{'EWOULDBLOCK'} ) { >>>>>> + warn "read from socket: $!"; >>>>>> + } >>>>>> + >>>>>> + if ( index( $response, "\n\n" ) > -1 ) { >>>>>> + >>>>>> + # Reject the response unless its thread ID matches what we >>>>>> sent. >>>>>> + # This prevents confusion among concurrent Pyzor reqeusts. >>>>>> + if ( index( $response, "\nThread: $thread_id\n" ) != -1 ) { >>>>>> + last; >>>>>> + } >>>>>> + else { >>>>>> + $response = ''; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> + my $found = select( $rout = $rin, undef, undef, $time_left ); >>>>>> + warn "select(): $!" if $found == -1; >>>>>> + } >>>>>> + >>>>>> + return $response; >>>>>> +} >>>>>> + >>>>>> +sub _send_packet { >>>>>> + my ( $self, $sock, $packet ) = @_; >>>>>> + >>>>>> + $sock->blocking(1); >>>>>> + syswrite( $sock, $packet ) or warn "write to socket: $!"; >>>>>> + >>>>>> + return; >>>>>> +} >>>>>> + >>>>>> +sub _get_connection_or_die { >>>>>> + my ($self) = @_; >>>>>> + >>>>>> + # clear the socket if the PID changes >>>>>> + if ( defined $self->{'_sock_pid'} && $self->{'_sock_pid'} != $$ ) { >>>>>> + undef $self->{'_sock_pid'}; >>>>>> + undef $self->{'_sock'}; >>>>>> + } >>>>>> + >>>>>> + $self->{'_sock_pid'} ||= $$; >>>>>> + $self->{'_sock'} ||= IO::Socket::INET->new( >>>>>> + 'PeerHost' => $self->{'_server_host'}, >>>>>> + 'PeerPort' => $self->{'_server_port'}, >>>>>> + 'Proto' => 'udp' >>>>>> + ) or die "Cannot connect to >>>>>> $self->{'_server_host'}:$self->{'_server_port'}: $@ $!"; >>>>>> + >>>>>> + return $self->{'_sock'}; >>>>>> +} >>>>>> + >>>>>> +sub _sign_msg { >>>>>> + my ( $self, $msg_ref ) = @_; >>>>>> + >>>>>> + $msg_ref->{'Sig'} = lc Digest::SHA::sha1_hex( >>>>>> + Digest::SHA::sha1( >>>>>> $self->_generate_packet_from_message($msg_ref) ) >>>>>> + ); >>>>>> + >>>>>> + return 1; >>>>>> +} >>>>>> + >>>>>> +sub _generate_packet_from_message { >>>>>> + my ( $self, $msg_ref ) = @_; >>>>>> + >>>>>> + return join( "\n", map { "$_: $msg_ref->{$_}" } grep { length >>>>>> $msg_ref->{$_} } @hash_order ); >>>>>> +} >>>>>> + >>>>>> +sub _generate_thread_id { >>>>>> + my $RAND_MAX = 2**16; >>>>>> + my $val = 0; >>>>>> + $val = int rand($RAND_MAX) while $val < 1024; >>>>>> + return $val; >>>>>> +} >>>>>> + >>>>>> +sub _get_user_pass_hash_key { >>>>>> + my ($self) = @_; >>>>>> + >>>>>> + return lc Digest::SHA::sha1_hex( $self->{'_username'} . ':' . >>>>>> $self->{'_password'} ); >>>>>> +} >>>>>> + >>>>>> +1; >>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor/Digest.pm >>>>>> b/lib/Mail/SpamAssassin/Pyzor/Digest.pm >>>>>> new file mode 100644 >>>>>> index 0000000..0e8a5ae >>>>>> --- /dev/null >>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor/Digest.pm >>>>>> @@ -0,0 +1,103 @@ >>>>>> +package Mail::SpamAssassin::Pyzor::Digest; >>>>>> + >>>>>> +# Copyright 2018 cPanel, LLC. >>>>>> +# All rights reserved. >>>>>> +# http://cpanel.net >>>>>> +# >>>>>> +# <@LICENSE> >>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more >>>>>> +# contributor license agreements. See the NOTICE file distributed with >>>>>> +# this work for additional information regarding copyright ownership. >>>>>> +# The ASF licenses this file to you under the Apache License, Version >>>>>> 2.0 >>>>>> +# (the "License"); you may not use this file except in compliance with >>>>>> +# the License. You may obtain a copy of the License at: >>>>>> +# >>>>>> +# http://www.apache.org/licenses/LICENSE-2.0 >>>>>> +# >>>>>> +# Unless required by applicable law or agreed to in writing, software >>>>>> +# distributed under the License is distributed on an "AS IS" BASIS, >>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>>>>> implied. >>>>>> +# See the License for the specific language governing permissions and >>>>>> +# limitations under the License. >>>>>> +# </@LICENSE> >>>>>> +# >>>>>> + >>>>>> +use strict; >>>>>> +use warnings; >>>>>> + >>>>>> +=encoding utf-8 >>>>>> + >>>>>> +=head1 NAME >>>>>> + >>>>>> +Mail::SpamAssassin::Pyzor::Digest >>>>>> + >>>>>> +=head1 SYNOPSIS >>>>>> + >>>>>> + my $digest = Mail::SpamAssassin::Pyzor::Digest::get( $mime_text ); >>>>>> + >>>>>> +=head1 DESCRIPTION >>>>>> + >>>>>> +A reimplementation of >>>>>> L<https://github.com/SpamExperts/pyzor/blob/master/pyzor/digest.py>. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +use Email::MIME (); >>>>>> + >>>>>> +use Mail::SpamAssassin::Pyzor::Digest::Pieces (); >>>>>> +use Digest::SHA qw(sha1_hex); >>>>>> + >>>>>> +our $VERSION = '0.03'; >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head1 FUNCTIONS >>>>>> + >>>>>> +=head2 $hex = get( $MSG ) >>>>>> + >>>>>> +This takes an email message in raw MIME text format (i.e., as saved in >>>>>> the >>>>>> +standard mbox format) and returns the message???s Pyzor digest in >>>>>> lower-case >>>>>> +hexadecimal. >>>>>> + >>>>>> +The output from this function should normally be identical to that of >>>>>> +the C<pyzor> script???s C<digest> command. It is suitable for use in >>>>>> +L<Mail::SpamAssassin::Pyzor::Client>???s request methods. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub get { >>>>>> + my ($text) = @_; >>>>>> + return Digest::SHA::sha1_hex( ${ _get_predigest( $text ) } ); >>>>>> +} >>>>>> + >>>>>> +# NB: This is called from the test. >>>>>> +sub _get_predigest { ## no critic qw(RequireArgUnpacking) >>>>>> + my ($msg_text_sr) = @_; >>>>>> + >>>>>> + my $parsed = Email::MIME->new($$msg_text_sr); >>>>>> + >>>>>> + my @lines; >>>>>> + >>>>>> + my $payloads_ar = >>>>>> Mail::SpamAssassin::Pyzor::Digest::Pieces::digest_payloads($parsed); >>>>>> + >>>>>> + for my $payload (@$payloads_ar) { >>>>>> + my @p_lines = >>>>>> Mail::SpamAssassin::Pyzor::Digest::Pieces::splitlines($payload); >>>>>> + for my $line (@p_lines) { >>>>>> + Mail::SpamAssassin::Pyzor::Digest::Pieces::normalize($line); >>>>>> + >>>>>> + next if >>>>>> !Mail::SpamAssassin::Pyzor::Digest::Pieces::should_handle_line($line); >>>>>> + >>>>>> + # Make sure we have an octet string. >>>>>> + utf8::encode($line) if utf8::is_utf8($line); >>>>>> + >>>>>> + push @lines, $line; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> + my $digest_sr = >>>>>> Mail::SpamAssassin::Pyzor::Digest::Pieces::assemble_lines( \@lines ); >>>>>> + >>>>>> + return $digest_sr; >>>>>> +} >>>>>> + >>>>>> +1; >>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor/Digest/Pieces.pm >>>>>> b/lib/Mail/SpamAssassin/Pyzor/Digest/Pieces.pm >>>>>> new file mode 100644 >>>>>> index 0000000..522accd >>>>>> --- /dev/null >>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor/Digest/Pieces.pm >>>>>> @@ -0,0 +1,301 @@ >>>>>> +package Mail::SpamAssassin::Pyzor::Digest::Pieces; >>>>>> + >>>>>> +# Copyright 2018 cPanel, LLC. >>>>>> +# All rights reserved. >>>>>> +# http://cpanel.net >>>>>> +# >>>>>> +# <@LICENSE> >>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more >>>>>> +# contributor license agreements. See the NOTICE file distributed with >>>>>> +# this work for additional information regarding copyright ownership. >>>>>> +# The ASF licenses this file to you under the Apache License, Version >>>>>> 2.0 >>>>>> +# (the "License"); you may not use this file except in compliance with >>>>>> +# the License. You may obtain a copy of the License at: >>>>>> +# >>>>>> +# http://www.apache.org/licenses/LICENSE-2.0 >>>>>> +# >>>>>> +# Unless required by applicable law or agreed to in writing, software >>>>>> +# distributed under the License is distributed on an "AS IS" BASIS, >>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>>>>> implied. >>>>>> +# See the License for the specific language governing permissions and >>>>>> +# limitations under the License. >>>>>> +# </@LICENSE> >>>>>> +# >>>>>> + >>>>>> +use strict; >>>>>> +use warnings; >>>>>> + >>>>>> +=encoding utf-8 >>>>>> + >>>>>> +=head1 NAME >>>>>> + >>>>>> +Mail::SpamAssassin::Pyzor::Digest::Pieces >>>>>> + >>>>>> +=head1 DESCRIPTION >>>>>> + >>>>>> +This module houses backend logic for >>>>>> L<Mail::SpamAssassin::Pyzor::Digest>. >>>>>> + >>>>>> +It reimplements logic found in pyzor???s F<digest.py> module >>>>>> +(L<https://github.com/SpamExperts/pyzor/blob/master/pyzor/digest.py>). >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +use Email::MIME::ContentType (); >>>>>> +use Encode (); >>>>>> + >>>>>> +our $VERSION = '0.03'; >>>>>> + >>>>>> +# each tuple is [ offset, length ] >>>>>> +use constant _HASH_SPEC => ( [ 20, 3 ], [ 60, 3 ] ); >>>>>> + >>>>>> +use constant { >>>>>> + _MIN_LINE_LENGTH => 8, >>>>>> + >>>>>> + _ATOMIC_NUM_LINES => 4, >>>>>> +}; >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head1 FUNCTIONS >>>>>> + >>>>>> +=head2 $strings_ar = digest_payloads( $EMAIL_MIME ) >>>>>> + >>>>>> +This imitates the corresponding object method in F<digest.py>. >>>>>> +It returns a reference to an array of strings. Each string can be either >>>>>> +a byte string or a character string (e.g., UTF-8 decoded). >>>>>> + >>>>>> +NB: RFC 2822 stipulates that message bodies should use CRLF >>>>>> +line breaks, not plain LF (nor plain CR). L<Email::MIME::Encodings> >>>>>> +will thus convert any plain CRs in a quoted-printable message >>>>>> +body into CRLF. Python, though, doesn???t do this, so the output of >>>>>> +our implementation of C<digest_payloads()> diverges from that of the >>>>>> Python >>>>>> +original. It doesn???t ultimately make a difference since the >>>>>> line-ending >>>>>> +whitespace gets trimmed regardless, but it???s necessary to factor in >>>>>> when >>>>>> +comparing the output of our implementation with the Python output. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub digest_payloads { >>>>>> + my ($parsed) = @_; >>>>>> + >>>>>> + my @subparts = $parsed->subparts(); >>>>>> + >>>>>> + my @payloads; >>>>>> + >>>>>> + if (@subparts) { >>>>>> + @payloads = map { @{ digest_payloads($_) } } >>>>>> $parsed->subparts(); >>>>>> + } >>>>>> + else { >>>>>> + my ( $main_type, $subtype, $encoding, $encode_check ) = >>>>>> parse_content_type( $parsed->content_type() ); >>>>>> + >>>>>> + my $payload; >>>>>> + >>>>>> + if ( $main_type eq 'text' ) { >>>>>> + >>>>>> + # Decode transfer encoding, but leave us as a byte string. >>>>>> + # Note that this is where Email::MIME converts plain LF to >>>>>> CRLF. >>>>>> + $payload = $parsed->body(); >>>>>> + >>>>>> + # This does the actual character decoding (i.e., >>>>>> ???charset???). >>>>>> + $payload = Encode::decode( $encoding, $payload, >>>>>> $encode_check ); >>>>>> + >>>>>> + if ( $subtype eq 'html' ) { >>>>>> + require Mail::SpamAssassin::Pyzor::Digest::StripHtml; >>>>>> + $payload = >>>>>> Mail::SpamAssassin::Pyzor::Digest::StripHtml::strip($payload); >>>>>> + } >>>>>> + } >>>>>> + else { >>>>>> + >>>>>> + # This does no decoding, even of, e.g., quoted-printable or >>>>>> base64. >>>>>> + $payload = $parsed->body_raw(); >>>>>> + } >>>>>> + >>>>>> + push @payloads, $payload; >>>>>> + } >>>>>> + >>>>>> + return \@payloads; >>>>>> +} >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head2 normalize( $STRING ) >>>>>> + >>>>>> +This imitates the corresponding object method in F<digest.py>. >>>>>> +It modifies C<$STRING> in-place. >>>>>> + >>>>>> +As with the original implementation, if C<$STRING> contains (decoded) >>>>>> +Unicode characters, those characters will be parsed accordingly. So: >>>>>> + >>>>>> + $str = "123\xc2\xa0"; # [ c2 a0 ] == \u00a0, non-breaking space >>>>>> + >>>>>> + normalize($str); >>>>>> + >>>>>> +The above will leave C<$str> alone, but this: >>>>>> + >>>>>> + utf8::decode($str); >>>>>> + >>>>>> + normalize($str); >>>>>> + >>>>>> +??? will trim off the last two bytes from C<$str>. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub normalize { ## no critic qw( Subroutines::RequireArgUnpacking ) >>>>>> + >>>>>> + # NULs are bad, mm-kay? >>>>>> + $_[0] =~ tr<\0><>d; >>>>>> + >>>>>> + # NB: Python???s \s without re.UNICODE is the same as Perl???s \s >>>>>> + # with the /a modifier. >>>>>> + # >>>>>> + # https://docs.python.org/2/library/re.html >>>>>> + # https://perldoc.perl.org/perlrecharclass.html#Backslash-sequences >>>>>> + >>>>>> + # Python: re.compile(r'\S{10,}') >>>>>> + $_[0] =~ s<\S{10,}><>ag; >>>>>> + >>>>>> + # Python: re.compile(r'\S+@\S+') >>>>>> + $_[0] =~ s<\S+ @ \S+><>agx; >>>>>> + >>>>>> + # Python: re.compile(r'[a-z]+:\S+', re.IGNORECASE) >>>>>> + $_[0] =~ s<[a-zA-Z]+ : \S+><>agx; >>>>>> + >>>>>> + # (from digest.py ???) >>>>>> + # Make sure we do the whitespace last because some of the previous >>>>>> + # patterns rely on whitespace. >>>>>> + $_[0] =~ tr< \x09-\x0d><>d; >>>>>> + >>>>>> + # This is fun. digest.py???s normalize() does a non-UNICODE >>>>>> whitespace >>>>>> + # strip, then calls strip() on the string, which *will* strip >>>>>> Unicode >>>>>> + # whitespace from the ends. >>>>>> + $_[0] =~ s<\A\s+><>; >>>>>> + $_[0] =~ s<\s+\z><>; >>>>>> + >>>>>> + return; >>>>>> +} >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head2 $yn = should_handle_line( $STRING ) >>>>>> + >>>>>> +This imitates the corresponding object method in F<digest.py>. >>>>>> +It returns a boolean. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub should_handle_line { >>>>>> + return $_[0] && length( $_[0] ) >= _MIN_LINE_LENGTH(); >>>>>> +} >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head2 $sr = assemble_lines( \@LINES ) >>>>>> + >>>>>> +This assembles a string buffer out of @LINES. The string is the buffer >>>>>> +of octets that will be hashed to produce the message digest. >>>>>> + >>>>>> +Each member of @LINES is expected to be an B<octet string>, not a >>>>>> +character string. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub assemble_lines { >>>>>> + my ($lines_ar) = @_; >>>>>> + >>>>>> + if ( @$lines_ar <= _ATOMIC_NUM_LINES() ) { >>>>>> + >>>>>> + # cf. handle_atomic() in digest.py >>>>>> + return \join( q<>, @$lines_ar ); >>>>>> + } >>>>>> + >>>>>> + >>>>>> #---------------------------------------------------------------------- >>>>>> + # cf. handle_atomic() in digest.py >>>>>> + >>>>>> + my $str = q<>; >>>>>> + >>>>>> + for my $ofs_len ( _HASH_SPEC() ) { >>>>>> + my ( $offset, $length ) = @$ofs_len; >>>>>> + >>>>>> + for my $i ( 0 .. ( $length - 1 ) ) { >>>>>> + my $idx = int( $offset * @$lines_ar / 100 ) + $i; >>>>>> + >>>>>> + next if !defined $lines_ar->[$idx]; >>>>>> + >>>>>> + $str .= $lines_ar->[$idx]; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> + return \$str; >>>>>> +} >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head2 ($main, $sub, $encoding, $checkval) = parse_content_type( >>>>>> $CONTENT_TYPE ) >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +use constant _QUOTED_PRINTABLE_NAMES => ( >>>>>> + "quopri-codec", >>>>>> + "quopri", >>>>>> + "quoted-printable", >>>>>> + "quotedprintable", >>>>>> +); >>>>>> + >>>>>> +# Make Encode::decode() ignore anything that doesn???t fit the >>>>>> +# given encoding. >>>>>> +use constant _encode_check_ignore => q<>; >>>>>> + >>>>>> +sub parse_content_type { >>>>>> + my ($content_type) = @_; >>>>>> + >>>>>> + $Email::MIME::ContentType::STRICT_PARAMS = 0; >>>>>> + my $ct_parse = Email::MIME::ContentType::parse_content_type( >>>>>> + $content_type, >>>>>> + ); >>>>>> + >>>>>> + my $main = $ct_parse->{'type'} || q<>; >>>>>> + my $sub = $ct_parse->{'subtype'} || q<>; >>>>>> + >>>>>> + my $encoding = $ct_parse->{'attributes'}{'charset'}; >>>>>> + >>>>>> + my $checkval; >>>>>> + >>>>>> + if ($encoding) { >>>>>> + >>>>>> + # Lower-case everything, convert underscore to dash, and remove >>>>>> NUL. >>>>>> + $encoding =~ tr<A-Z_\0><a-z->d; >>>>>> + >>>>>> + # Apparently pyzor accommodates messages that put the transfer >>>>>> + # encoding in the Content-Type. >>>>>> + if ( grep { $_ eq $encoding } _QUOTED_PRINTABLE_NAMES() ) { >>>>>> + $checkval = Encode::FB_CROAK(); >>>>>> + } >>>>>> + } >>>>>> + else { >>>>>> + $encoding = 'ascii'; >>>>>> + } >>>>>> + >>>>>> + # Match Python .decode()???s 'ignore' behavior >>>>>> + $checkval ||= \&_encode_check_ignore; >>>>>> + >>>>>> + return ( $main, $sub, $encoding, $checkval ); >>>>>> +} >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head2 @lines = splitlines( $TEXT ) >>>>>> + >>>>>> +Imitates C<str.splitlines()>. (cf. C<pydoc str>) >>>>>> + >>>>>> +Returns a plain list in list context. Returns the number of >>>>>> +items to be returned in scalar context. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub splitlines { >>>>>> + return split m<\r\n?|\n>, $_[0]; >>>>>> +} >>>>>> + >>>>>> +1; >>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor/Digest/StripHtml.pm >>>>>> b/lib/Mail/SpamAssassin/Pyzor/Digest/StripHtml.pm >>>>>> new file mode 100644 >>>>>> index 0000000..2617b4a >>>>>> --- /dev/null >>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor/Digest/StripHtml.pm >>>>>> @@ -0,0 +1,177 @@ >>>>>> +package Mail::SpamAssassin::Pyzor::Digest::StripHtml; >>>>>> + >>>>>> +# Copyright 2018 cPanel, LLC. >>>>>> +# All rights reserved. >>>>>> +# http://cpanel.net >>>>>> +# >>>>>> +# <@LICENSE> >>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more >>>>>> +# contributor license agreements. See the NOTICE file distributed with >>>>>> +# this work for additional information regarding copyright ownership. >>>>>> +# The ASF licenses this file to you under the Apache License, Version >>>>>> 2.0 >>>>>> +# (the "License"); you may not use this file except in compliance with >>>>>> +# the License. You may obtain a copy of the License at: >>>>>> +# >>>>>> +# http://www.apache.org/licenses/LICENSE-2.0 >>>>>> +# >>>>>> +# Unless required by applicable law or agreed to in writing, software >>>>>> +# distributed under the License is distributed on an "AS IS" BASIS, >>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>>>>> implied. >>>>>> +# See the License for the specific language governing permissions and >>>>>> +# limitations under the License. >>>>>> +# </@LICENSE> >>>>>> +# >>>>>> + >>>>>> +use strict; >>>>>> +use warnings; >>>>>> + >>>>>> +=encoding utf-8 >>>>>> + >>>>>> +=head1 NAME >>>>>> + >>>>>> +Mail::SpamAssassin::Pyzor::Digest::StripHtml >>>>>> + >>>>>> +=head1 SYNOPSIS >>>>>> + >>>>>> + my $stripped = >>>>>> Mail::SpamAssassin::Pyzor::Digest::StripHtml::strip($html); >>>>>> + >>>>>> +=head1 DESCRIPTION >>>>>> + >>>>>> +This module attempts to duplicate pyzor???s HTML-stripping logic. >>>>>> + >>>>>> +=head1 ACCURACY >>>>>> + >>>>>> +This library cannot achieve 100%, bug-for-bug parity with pyzor >>>>>> +because to do so would require duplicating Python???s own HTML parsing >>>>>> +library. Since that library???s output has changed over time, and those >>>>>> +changes in turn affect pyzor, it???s literally impossible to arrive at >>>>>> +a single, fully-compatible reimplementation. >>>>>> + >>>>>> +That said, all known divergences between pyzor and this library involve >>>>>> +invalid HTML as input. >>>>>> + >>>>>> +Please open bug reports for any divergences you identify, particularly >>>>>> +if the input is valid HTML. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +use HTML::Parser (); >>>>>> + >>>>>> +our $VERSION = '0.03'; >>>>>> + >>>>>> +#---------------------------------------------------------------------- >>>>>> + >>>>>> +=head1 FUNCTIONS >>>>>> + >>>>>> +=head2 $stripped = strip( $HTML ) >>>>>> + >>>>>> +Give it some HTML, and it???ll give back the stripped text. >>>>>> + >>>>>> +In B<general>, the stripping consists of removing tags as well as >>>>>> +C<E<lt>scriptE<gt>> and C<E<lt>styleE<gt>> elements; however, it also >>>>>> +removes HTML entities. >>>>>> + >>>>>> +This tries very hard to duplicate pyzor???s behavior with invalid HTML. >>>>>> + >>>>>> +=cut >>>>>> + >>>>>> +sub strip { >>>>>> + my ($html) = @_; >>>>>> + >>>>>> + $html =~ s<\A\s+><>; >>>>>> + $html =~ s<\s+\z><>; >>>>>> + >>>>>> + my $p = HTML::Parser->new( api_version => 3 ); >>>>>> + >>>>>> + my @pieces; >>>>>> + >>>>>> + my $accumulate = 1; >>>>>> + >>>>>> + $p->handler( >>>>>> + start => sub { >>>>>> + my ($tagname) = @_; >>>>>> + >>>>>> + $accumulate = 0 if $tagname eq 'script'; >>>>>> + $accumulate = 0 if $tagname eq 'style'; >>>>>> + >>>>>> + return; >>>>>> + }, >>>>>> + 'tagname', >>>>>> + ); >>>>>> + >>>>>> + $p->handler( >>>>>> + end => sub { >>>>>> + $accumulate = 1; >>>>>> + return; >>>>>> + } >>>>>> + ); >>>>>> + >>>>>> + $p->handler( >>>>>> + text => sub { >>>>>> + my ($copy) = @_; >>>>>> + >>>>>> + return if !$accumulate; >>>>>> + >>>>>> + # pyzor???s HTML parser discards HTML entities. On top of >>>>>> that, >>>>>> + # we need to match, as closely as possible, pyzor???s >>>>>> handling of >>>>>> + # invalid HTML entities ??? which is a function of >>>>>> Python???s >>>>>> + # standard HTML parsing library. This will probably never be >>>>>> + # fully compatible with the pyzor, but we can get it close. >>>>>> + >>>>>> + # The original is: >>>>>> + # >>>>>> + # re.compile('&#(?:[0-9]+|[xX][0-9a-fA-F]+)[^0-9a-fA-F]') >>>>>> + # >>>>>> + # The parsing loop then ???backs up??? one byte if the last >>>>>> + # character isn???t a ???;???. We use a look-ahead >>>>>> assertion to >>>>>> + # mimic that behavior. >>>>>> + $copy =~ s<\&\# (?:[0-9]+ | [xX][0-9a-fA-F]+) (?: ; | \z | >>>>>> (?=[^0-9a-fA-F]) )>< >gx; >>>>>> + >>>>>> + # The original is: >>>>>> + # >>>>>> + # re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') >>>>>> + # >>>>>> + # We again use a look-ahead assertion to mimic Python. >>>>>> + $copy =~ s<\& [a-zA-Z] [-.a-zA-Z0-9]* (?: ; | \z | >>>>>> (?=[^a-zA-Z0-9]) )>< >gx; >>>>>> + >>>>>> + # Python???s HTMLParser aborts its parsing loop when it >>>>>> encounters >>>>>> + # an invalid numeric reference. >>>>>> + $copy =~ s<\&\# >>>>>> + (?: >>>>>> + [^0-9xX] # anything but the expected first >>>>>> char >>>>>> + | >>>>>> + [0-9]+[a-fA-F] # hex within decimal >>>>>> + | >>>>>> + [xX][^0-9a-fA-F] >>>>>> + ) >>>>>> + (.*) >>>>>> + >< >>>>>> + ( -1 == index($1, ';') ) ? q<> : '&#' >>>>>> + >exs; >>>>>> + >>>>>> + # Python???s HTMLParser treats invalid entities as >>>>>> incomplete >>>>>> + $copy =~ s<(\&\#?)><$1 >gx; >>>>>> + >>>>>> + $copy =~ s<\A\s+><>; >>>>>> + $copy =~ s<\s+\z><>; >>>>>> + >>>>>> + push @pieces, \$copy if length $copy; >>>>>> + }, >>>>>> + 'text,tagname', >>>>>> + ); >>>>>> + >>>>>> + $p->parse($html); >>>>>> + $p->eof(); >>>>>> + >>>>>> + my $payload = join( q< >, map { $$_ } @pieces ); >>>>>> + >>>>>> + # Convert all sequences of whitespace OTHER THAN non-breaking >>>>>> spaces to >>>>>> + # plain spaces. >>>>>> + $payload =~ s<[^\S\x{a0}]+>< >g; >>>>>> + >>>>>> + return $payload; >>>>>> +} >>>>>> + >>>>>> +1; >>>>>> diff --git a/t/pyzor.t b/t/pyzor.t >>>>>> index 891f38d..e4ef83f 100755 >>>>>> --- a/t/pyzor.t >>>>>> +++ b/t/pyzor.t >>>>>> @@ -3,12 +3,9 @@ >>>>>> use lib '.'; use lib 't'; >>>>>> use SATest; sa_t_init("pyzor"); >>>>>> -use constant HAS_PYZOR => eval { $_ = untaint_cmd("which pyzor"); >>>>>> chomp; -x }; >>>>>> - >>>>>> use Test::More; >>>>>> plan skip_all => "Net tests disabled" unless >>>>>> conf_bool('run_net_tests'); >>>>>> -plan skip_all => "Pyzor executable not found in path" unless HAS_PYZOR; >>>>>> -plan tests => 8; >>>>>> +plan tests => 5; >>>>>> diag('Note: Failures may not be an SpamAssassin bug, as Pyzor tests >>>>>> can fail due to problems with the Pyzor servers.'); >>>>>> @@ -30,7 +27,7 @@ tstprefs (" >>>>>> sarun ("-t < data/spam/pyzor", \&patterns_run_cb); >>>>>> ok_all_patterns(); >>>>>> # Same with fork >>>>>> -sarun ("--cf='pyzor_fork 1' -t < data/spam/pyzor", \&patterns_run_cb); >>>>>> +sarun ("-t < data/spam/pyzor", \&patterns_run_cb); >>>>>> ok_all_patterns(); >>>>>> #TESTING FOR HAM >>>>>> @@ -44,7 +41,3 @@ ok_all_patterns(); >>>>>> sarun ("-D pyzor -t < data/nice/001 2>&1", \&patterns_run_cb); >>>>>> ok_all_patterns(); >>>>>> -# same with fork >>>>>> -sarun ("-D pyzor --cf='pyzor_fork 1' -t < data/nice/001 2>&1", >>>>>> \&patterns_run_cb); >>>>>> -ok_all_patterns(); >>>>>> - >>>>> >>> >> -- >> Kevin A. McGrail >> kmcgr...@apache.org >> >> Member, Apache Software Foundation >> Chair Emeritus Apache SpamAssassin Project >> https://www.linkedin.com/in/kmcgrail - 703.798.0171
OpenPGP_signature
Description: OpenPGP digital signature