I removed IO::SigGuard dependency from my diff and I will work on deleting 
Email::MIME as well.
 Giovanni

On 10/17/21 16:58, Henrik K wrote:
> 
> Atleast these seem completely unneeded module dependencies.
> 
> IO::SigGuard (not even found in Ubuntu packages)
> Email::MIME
> 
> So the code should be refactored to use SA methods as necessary.
> 
> 
> On Sat, Oct 16, 2021 at 11:06:07PM -0400, Kevin A. McGrail wrote:
>> No worries there that I know of.
>>
>> cPanel has the paperwork for CCLA on file and several people with ICLA's as
>> well.  They've given us permission to commit the code too.
>>
>> I think it will be better than any dependency on external binaries.
>>
>> Regards,
>>
>> KAM
>>
>> On 10/14/2021 10:37 AM, Henrik K wrote:
>>> If that's the case, I probably wouldn't have any objections.  Not sure if it
>>> requires some Contributor License Agreement from cPanels part (maybe they
>>> already have one), and I guess atleast a bug to make it official..  Sidney
>>> or KAM can probably chime in on the admin side..
>>>
>>>
>>> On Thu, Oct 14, 2021 at 04:32:53PM +0200, Giovanni Bechis wrote:
>>>> Once committed, code will be no more developed by cPanel on CPAN
>>>> and original code will be removed.
>>>>
>>>> I can work to integrate old and new Pyzor versions.
>>>>
>>>>   Giovanni
>>>>
>>>> On Thu, Oct 14, 2021 at 05:27:16PM +0300, Henrik K wrote:
>>>>> If it's developed by cPanel in CPAN, then it should not be committed to 
>>>>> SA,
>>>>> unless it's clearly donated to SpamAssassin and removed from CPAN.  
>>>>> Assuming
>>>>> we have developer resources and will to take it aboard.
>>>>>
>>>>> As it is, Plugin/Pyzor.pm should have an option to choose which one to 
>>>>> use,
>>>>> as it makes no sense to ditch support for the widely installed original
>>>>> Pyzor.
>>>>>
>>>>>
>>>>> On Thu, Oct 14, 2021 at 04:15:13PM +0200, Giovanni Bechis wrote:
>>>>>> Hi,
>>>>>> cPanel has developed a native Perl Pyzor implementation for SpamAssassin
>>>>>> and a diff against SpamAssassin 4.0 follows.
>>>>>> Atm I am using it in production on a small server, more tests and
>>>>>> opinions are welcome.
>>>>>>
>>>>>> Original cPanel code is at https://metacpan.org/pod/Mail::Pyzor.
>>>>>>
>>>>>>   Cheers
>>>>>>    Giovanni
>>>>>>
>>>>>> diff --git a/MANIFEST b/MANIFEST
>>>>>> index 25d0192..2d9588c 100644
>>>>>> --- a/MANIFEST
>>>>>> +++ b/MANIFEST
>>>>>> @@ -126,6 +126,11 @@ lib/Mail/SpamAssassin/Plugin/WLBLEval.pm
>>>>>>   lib/Mail/SpamAssassin/Plugin/WhiteListSubject.pm
>>>>>>   lib/Mail/SpamAssassin/PluginHandler.pm
>>>>>>   lib/Mail/SpamAssassin/Plugin/URILocalBL.pm
>>>>>> +lib/Mail/SpamAssassin/Pyzor/Client.pm
>>>>>> +lib/Mail/SpamAssassin/Pyzor/Digest/Pieces.pm
>>>>>> +lib/Mail/SpamAssassin/Pyzor/Digest/StripHtml.pm
>>>>>> +lib/Mail/SpamAssassin/Pyzor/Digest.pm
>>>>>> +lib/Mail/SpamAssassin/Pyzor.pm
>>>>>>   lib/Mail/SpamAssassin/RegistryBoundaries.pm
>>>>>>   lib/Mail/SpamAssassin/Reporter.pm
>>>>>>   lib/Mail/SpamAssassin/SQLBasedAddrList.pm
>>>>>> diff --git a/lib/Mail/SpamAssassin/Plugin/Pyzor.pm 
>>>>>> b/lib/Mail/SpamAssassin/Plugin/Pyzor.pm
>>>>>> index 3efd4b4..e4c9c05 100644
>>>>>> --- a/lib/Mail/SpamAssassin/Plugin/Pyzor.pm
>>>>>> +++ b/lib/Mail/SpamAssassin/Plugin/Pyzor.pm
>>>>>> @@ -36,17 +36,13 @@ package Mail::SpamAssassin::Plugin::Pyzor;
>>>>>>   use Mail::SpamAssassin::Plugin;
>>>>>>   use Mail::SpamAssassin::Logger;
>>>>>> -use Mail::SpamAssassin::Timeout;
>>>>>> -use Mail::SpamAssassin::Util qw(untaint_var untaint_file_path
>>>>>> -                                proc_status_ok exit_status_str);
>>>>>> +use Mail::SpamAssassin::Util qw(untaint_var);
>>>>>> +
>>>>>>   use strict;
>>>>>>   use warnings;
>>>>>>   # use bytes;
>>>>>>   use re 'taint';
>>>>>> -use Storable;
>>>>>> -use POSIX qw(PIPE_BUF WNOHANG _exit);
>>>>>> -
>>>>>>   our @ISA = qw(Mail::SpamAssassin::Plugin);
>>>>>>   sub new {
>>>>>> @@ -78,7 +74,7 @@ sub set_config {
>>>>>>     my ($self, $conf) = @_;
>>>>>>     my @cmds;
>>>>>> -=head1 USER OPTIONS
>>>>>> +=head1 ADMINISTRATOR OPTIONS
>>>>>>   =over 4
>>>>>> @@ -95,22 +91,7 @@ Whether to use Pyzor, if it is available.
>>>>>>       type => $Mail::SpamAssassin::Conf::CONF_TYPE_BOOL
>>>>>>     });
>>>>>> -=item pyzor_fork (0|1)          (default: 0)
>>>>>> -
>>>>>> -Instead of running Pyzor synchronously, fork separate process for it and
>>>>>> -read the results in later (similar to async DNS lookups).  Increases
>>>>>> -throughput.  Experimental.
>>>>>> -
>>>>>> -=cut
>>>>>> -
>>>>>> -  push(@cmds, {
>>>>>> -    setting => 'pyzor_fork',
>>>>>> -    is_admin => 1,
>>>>>> -    default => 0,
>>>>>> -    type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC,
>>>>>> -  });
>>>>>> -
>>>>>> -=item pyzor_count_min NUMBER    (default: 5)
>>>>>> +=item pyzor_count_min NUMBER            (default: 5)
>>>>>>   This option sets how often a message's body checksum must have been
>>>>>>   reported to the Pyzor server before SpamAssassin will consider the 
>>>>>> Pyzor
>>>>>> @@ -128,54 +109,8 @@ set this to a relatively low value, e.g. C<5>.
>>>>>>       type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC
>>>>>>     });
>>>>>> -  # Deprecated setting, the name makes no sense!
>>>>>> -  push (@cmds, {
>>>>>> -    setting => 'pyzor_max',
>>>>>> -    is_admin => 1,
>>>>>> -    type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC,
>>>>>> -    code => sub {
>>>>>> -      my ($self, $key, $value, $line) = @_;
>>>>>> -      warn("deprecated setting used, change pyzor_max to 
>>>>>> pyzor_count_min\n");
>>>>>> -      if ($value !~ /^\d+$/) {
>>>>>> -        return $Mail::SpamAssassin::Conf::INVALID_VALUE;
>>>>>> -      }
>>>>>> -      $self->{pyzor_count_min} = $value;
>>>>>> -    }
>>>>>> -  });
>>>>>> -
>>>>>> -=item pyzor_whitelist_min NUMBER        (default: 10)
>>>>>> -
>>>>>> -This option sets how often a message's body checksum must have been
>>>>>> -whitelisted to the Pyzor server for SpamAssassin to consider ignoring 
>>>>>> the
>>>>>> -result.  Final decision is made by pyzor_whitelist_factor.
>>>>>> -
>>>>>> -=cut
>>>>>> -
>>>>>> -  push (@cmds, {
>>>>>> -    setting => 'pyzor_whitelist_min',
>>>>>> -    is_admin => 1,
>>>>>> -    default => 10,
>>>>>> -    type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC
>>>>>> -  });
>>>>>> -
>>>>>> -=item pyzor_whitelist_factor NUMBER     (default: 0.2)
>>>>>> -
>>>>>> -Ignore Pyzor result if REPORTCOUNT x NUMBER >= pyzor_whitelist_min.
>>>>>> -For default setting this means: 50 reports requires 10 whitelistings.
>>>>>> -
>>>>>> -=cut
>>>>>> -
>>>>>> -  push (@cmds, {
>>>>>> -    setting => 'pyzor_whitelist_factor',
>>>>>> -    is_admin => 1,
>>>>>> -    default => 0.2,
>>>>>> -    type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC
>>>>>> -  });
>>>>>> -
>>>>>>   =back
>>>>>> -=head1 ADMINISTRATOR OPTIONS
>>>>>> -
>>>>>>   =over 4
>>>>>>   =item pyzor_timeout n          (default: 5)
>>>>>> @@ -210,478 +145,182 @@ removing one of them.
>>>>>>       type => $Mail::SpamAssassin::Conf::CONF_TYPE_DURATION
>>>>>>     });
>>>>>> -=item pyzor_options options
>>>>>> +=item pyzor_whitelist_min NUMBER        (default: 10)
>>>>>> -Specify additional options to the pyzor(1) command. Please note that 
>>>>>> only
>>>>>> -characters in the range [0-9A-Za-z =,._/-] are allowed for security 
>>>>>> reasons.
>>>>>> +This option sets how often a message's body checksum must have been
>>>>>> +whitelisted to the Pyzor server for SpamAssassin to consider ignoring 
>>>>>> the
>>>>>> +result.  Final decision is made by pyzor_whitelist_factor.
>>>>>>   =cut
>>>>>>     push (@cmds, {
>>>>>> -    setting => 'pyzor_options',
>>>>>> +    setting => 'pyzor_whitelist_min',
>>>>>>       is_admin => 1,
>>>>>> -    default => '',
>>>>>> -    type => $Mail::SpamAssassin::Conf::CONF_TYPE_STRING,
>>>>>> -    code => sub {
>>>>>> -      my ($self, $key, $value, $line) = @_;
>>>>>> -      if ($value !~ m{^([0-9A-Za-z =,._/-]+)$}) {
>>>>>> -        return $Mail::SpamAssassin::Conf::INVALID_VALUE;
>>>>>> -      }
>>>>>> -      $self->{pyzor_options} = $1;
>>>>>> -    }
>>>>>> +    default => 10,
>>>>>> +    type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC
>>>>>>     });
>>>>>> -=item pyzor_path STRING
>>>>>> +=item pyzor_whitelist_factor NUMBER     (default: 0.2)
>>>>>> -This option tells SpamAssassin specifically where to find the C<pyzor>
>>>>>> -client instead of relying on SpamAssassin to find it in the current
>>>>>> -PATH.  Note that if I<taint mode> is enabled in the Perl interpreter,
>>>>>> -you should use this, as the current PATH will have been cleared.
>>>>>> +Ignore Pyzor result if REPORTCOUNT x NUMBER >= pyzor_whitelist_min.
>>>>>> +For default setting this means: 50 reports requires 10 whitelistings.
>>>>>>   =cut
>>>>>>     push (@cmds, {
>>>>>> -    setting => 'pyzor_path',
>>>>>> +    setting => 'pyzor_whitelist_factor',
>>>>>>       is_admin => 1,
>>>>>> -    default => undef,
>>>>>> -    type => $Mail::SpamAssassin::Conf::CONF_TYPE_STRING,
>>>>>> -    code => sub {
>>>>>> -      my ($self, $key, $value, $line) = @_;
>>>>>> -      if (!defined $value || !length $value) {
>>>>>> -        return $Mail::SpamAssassin::Conf::MISSING_REQUIRED_VALUE;
>>>>>> -      }
>>>>>> -      $value = untaint_file_path($value);
>>>>>> -      if (!-x $value) {
>>>>>> -        info("config: pyzor_path \"$value\" isn't an executable");
>>>>>> -        return $Mail::SpamAssassin::Conf::INVALID_VALUE;
>>>>>> -      }
>>>>>> -
>>>>>> -      $self->{pyzor_path} = $value;
>>>>>> -    }
>>>>>> +    default => 0.2,
>>>>>> +    type => $Mail::SpamAssassin::Conf::CONF_TYPE_NUMERIC
>>>>>>     });
>>>>>>     $conf->{parser}->register_commands(\@cmds);
>>>>>>   }
>>>>>>   sub is_pyzor_available {
>>>>>> -  my ($self) = @_;
>>>>>> +    my ($self) = @_;
>>>>>> -  my $pyzor = $self->{main}->{conf}->{pyzor_path} ||
>>>>>> -    Mail::SpamAssassin::Util::find_executable_in_env_path('pyzor');
>>>>>> -
>>>>>> -  unless ($pyzor && -x $pyzor) {
>>>>>> -    dbg("pyzor: no pyzor executable found");
>>>>>> -    $self->{pyzor_available} = 0;
>>>>>> -    return 0;
>>>>>> -  }
>>>>>> -
>>>>>> -  # remember any found pyzor
>>>>>> -  $self->{main}->{conf}->{pyzor_path} = $pyzor;
>>>>>> -
>>>>>> -  dbg("pyzor: pyzor is available: $pyzor");
>>>>>> -  return 1;
>>>>>> +    local $@;
>>>>>> +    eval {
>>>>>> +        require Mail::SpamAssassin::Pyzor::Digest;
>>>>>> +        require Mail::SpamAssassin::Pyzor::Client;
>>>>>> +    };
>>>>>> +    return $@ ? 0 : 1;
>>>>>>   }
>>>>>> -sub finish_parsing_start {
>>>>>> -  my ($self, $opts) = @_;
>>>>>> +sub get_pyzor_interface {
>>>>>> +  my ($self) = @_;
>>>>>> -  # If forking, hard adjust priority -100 to launch early
>>>>>> -  # Find rulenames from eval_to_rule mappings
>>>>>> -  if ($opts->{conf}->{pyzor_fork}) {
>>>>>> -    foreach (@{$opts->{conf}->{eval_to_rule}->{check_pyzor}}) {
>>>>>> -      dbg("pyzor: adjusting rule $_ priority to -100");
>>>>>> -      $opts->{conf}->{priority}->{$_} = -100;
>>>>>> -    }
>>>>>> +  if (!$self->{main}->{conf}->{use_pyzor}) {
>>>>>> +    dbg("pyzor: use_pyzor option not enabled, disabling Pyzor");
>>>>>> +    $self->{pyzor_interface} = "disabled";
>>>>>> +    $self->{pyzor_available} = 0;
>>>>>> +  }
>>>>>> +  elsif ($self->is_pyzor_available()) {
>>>>>> +    $self->{pyzor_interface} = "pyzor";
>>>>>> +    $self->{pyzor_available} = 1;
>>>>>> +  }
>>>>>> +  else {
>>>>>> +    dbg("pyzor: no pyzor found, disabling Pyzor");
>>>>>> +    $self->{pyzor_available} = 0;
>>>>>>     }
>>>>>>   }
>>>>>>   sub check_pyzor {
>>>>>> -  my ($self, $pms, $full) = @_;
>>>>>> -
>>>>>> -  return 0 if !$self->{pyzor_available};
>>>>>> -  return 0 if !$self->{main}->{conf}->{use_pyzor};
>>>>>> -
>>>>>> -  return 0 if $pms->{pyzor_running};
>>>>>> -  $pms->{pyzor_running} = 1;
>>>>>> -
>>>>>> -  return 0 if !$self->is_pyzor_available();
>>>>>> -
>>>>>> -  my $timer = $self->{main}->time_method("check_pyzor");
>>>>>> +  my ($self, $permsgstatus, $full) = @_;
>>>>>>     # initialize valid tags
>>>>>> -  $pms->{tag_data}->{PYZOR} = '';
>>>>>> -
>>>>>> -  # create fulltext tmpfile now (before possible forking)
>>>>>> -  $pms->{pyzor_tmpfile} = $pms->create_fulltext_tmpfile();
>>>>>> -
>>>>>> -  ## non-forking method
>>>>>> -
>>>>>> -  if (!$self->{main}->{conf}->{pyzor_fork}) {
>>>>>> -    my @results = $self->pyzor_lookup($pms);
>>>>>> -    return $self->_check_result($pms, \@results);
>>>>>> -  }
>>>>>> -
>>>>>> -  ## forking method
>>>>>> -
>>>>>> -  $pms->{pyzor_rulename} = $pms->get_current_eval_rule_name();
>>>>>> -  $pms->rule_pending($pms->{pyzor_rulename}); # mark async
>>>>>> -
>>>>>> -  # create socketpair for communication
>>>>>> -  $pms->{pyzor_backchannel} = 
>>>>>> Mail::SpamAssassin::SubProcBackChannel->new();
>>>>>> -  my $back_selector = '';
>>>>>> -  $pms->{pyzor_backchannel}->set_selector(\$back_selector);
>>>>>> -  eval {
>>>>>> -    $pms->{pyzor_backchannel}->setup_backchannel_parent_pre_fork();
>>>>>> -  } or do {
>>>>>> -    dbg("pyzor: backchannel pre-setup failed: $@");
>>>>>> -    delete $pms->{pyzor_backchannel};
>>>>>> -    return 0;
>>>>>> -  };
>>>>>> +  $permsgstatus->{tag_data}->{PYZOR} = "";
>>>>>> -  my $pid = fork();
>>>>>> -  if (!defined $pid) {
>>>>>> -    info("pyzor: child fork failed: $!");
>>>>>> -    delete $pms->{pyzor_backchannel};
>>>>>> -    return 0;
>>>>>> -  }
>>>>>> -  if (!$pid) {
>>>>>> -    $0 = "$0 (pyzor)";
>>>>>> -    $SIG{CHLD} = 'DEFAULT';
>>>>>> -    $SIG{PIPE} = 'IGNORE';
>>>>>> -    $SIG{$_} = sub {
>>>>>> -      eval { dbg("pyzor: child process $$ caught signal $_[0]"); };
>>>>>> -      _exit(6);  # avoid END and destructor processing
>>>>>> -      kill('KILL',$$);  # still kicking? die!
>>>>>> -      } foreach qw(INT HUP TERM TSTP QUIT USR1 USR2);
>>>>>> -    dbg("pyzor: child process $$ forked");
>>>>>> -    $pms->{pyzor_backchannel}->setup_backchannel_child_post_fork();
>>>>>> -    my @results = $self->pyzor_lookup($pms);
>>>>>> -    my $backmsg;
>>>>>> -    eval {
>>>>>> -      $backmsg = Storable::freeze(\@results);
>>>>>> -    };
>>>>>> -    if ($@) {
>>>>>> -      dbg("pyzor: child return value freeze failed: $@");
>>>>>> -      _exit(0); # avoid END and destructor processing
>>>>>> -    }
>>>>>> -    if (!syswrite($pms->{pyzor_backchannel}->{parent}, $backmsg)) {
>>>>>> -      dbg("pyzor: child backchannel write failed: $!");
>>>>>> -    }
>>>>>> -    _exit(0); # avoid END and destructor processing
>>>>>> -  }
>>>>>> -
>>>>>> -  $pms->{pyzor_pid} = $pid;
>>>>>> +  my $timer = $self->{main}->time_method("check_pyzor");
>>>>>> -  eval {
>>>>>> -    $pms->{pyzor_backchannel}->setup_backchannel_parent_post_fork($pid);
>>>>>> -  } or do {
>>>>>> -    dbg("pyzor: backchannel post-setup failed: $@");
>>>>>> -    delete $pms->{pyzor_backchannel};
>>>>>> -    return 0;
>>>>>> -  };
>>>>>> +  $self->get_pyzor_interface();
>>>>>> +  return 0 unless $self->{pyzor_available};
>>>>>> -  return 0;
>>>>>> +  return $self->pyzor_lookup($permsgstatus, $full);
>>>>>>   }
>>>>>>   sub pyzor_lookup {
>>>>>> -  my ($self, $pms) = @_;
>>>>>> -
>>>>>> -  my $conf = $self->{main}->{conf};
>>>>>> -  my $timeout = $conf->{pyzor_timeout};
>>>>>> -
>>>>>> -  # note: not really tainted, this came from system configuration file
>>>>>> -  my $path = untaint_file_path($conf->{pyzor_path});
>>>>>> -  my $opts = untaint_var($conf->{pyzor_options}) || '';
>>>>>> -
>>>>>> -  $pms->enter_helper_run_mode();
>>>>>> -
>>>>>> -  my $pid;
>>>>>> -  my @resp;
>>>>>> -  my $timer = Mail::SpamAssassin::Timeout->new(
>>>>>> -           { secs => $timeout, deadline => $pms->{master_deadline} });
>>>>>> -  my $err = $timer->run_and_catch(sub {
>>>>>> -    local $SIG{PIPE} = sub { die "__brokenpipe__ignore__\n" };
>>>>>> -
>>>>>> -    dbg("pyzor: opening pipe: ".
>>>>>> -      join(' ', $path, $opts, "check", "<".$pms->{pyzor_tmpfile}));
>>>>>> -
>>>>>> -    $pid = Mail::SpamAssassin::Util::helper_app_pipe_open(*PYZOR,
>>>>>> -        $pms->{pyzor_tmpfile}, 1, $path, split(' ', $opts), "check");
>>>>>> -    $pid or die "$!\n";
>>>>>> -
>>>>>> -    # read+split avoids a Perl I/O bug (Bug 5985)
>>>>>> -    my($inbuf, $nread);
>>>>>> -    my $resp = '';
>>>>>> -    while ($nread = read(PYZOR, $inbuf, 8192)) { $resp .= $inbuf }
>>>>>> -    defined $nread  or die "error reading from pipe: $!";
>>>>>> -    @resp = split(/^/m, $resp, -1);
>>>>>> -
>>>>>> -    my $errno = 0;
>>>>>> -    close PYZOR or $errno = $!;
>>>>>> -    if (proc_status_ok($?, $errno)) {
>>>>>> -      dbg("pyzor: [%s] finished successfully", $pid);
>>>>>> -    } elsif (proc_status_ok($?, $errno, 0, 1)) {  # sometimes it exits 
>>>>>> with 1
>>>>>> -      dbg("pyzor: [%s] finished: %s", $pid, exit_status_str($?, 
>>>>>> $errno));
>>>>>> -    } else {
>>>>>> -      info("pyzor: [%s] error: %s", $pid, exit_status_str($?, $errno));
>>>>>> -    }
>>>>>> -
>>>>>> -  });
>>>>>> -
>>>>>> -  if (defined(fileno(*PYZOR))) {  # still open
>>>>>> -    if ($pid) {
>>>>>> -      if (kill('TERM', $pid)) {
>>>>>> -        dbg("pyzor: killed stale helper [$pid]");
>>>>>> -      } else {
>>>>>> -        dbg("pyzor: killing helper application [$pid] failed: $!");
>>>>>> -      }
>>>>>> -    }
>>>>>> -    my $errno = 0;
>>>>>> -    close PYZOR or $errno = $!;
>>>>>> -    proc_status_ok($?, $errno)
>>>>>> -      or info("pyzor: [%s] error: %s", $pid, exit_status_str($?, 
>>>>>> $errno));
>>>>>> -  }
>>>>>> -
>>>>>> -  $pms->leave_helper_run_mode();
>>>>>> -
>>>>>> -  if ($timer->timed_out()) {
>>>>>> -    dbg("pyzor: check timed out after $timeout seconds");
>>>>>> -    return ();
>>>>>> -  } elsif ($err) {
>>>>>> -    chomp $err;
>>>>>> -    info("pyzor: check failed: $err");
>>>>>> -    return ();
>>>>>> -  }
>>>>>> -
>>>>>> -  return @resp;
>>>>>> -}
>>>>>> -
>>>>>> -sub check_tick {
>>>>>> -  my ($self, $opts) = @_;
>>>>>> -  $self->_check_forked_result($opts->{permsgstatus}, 0);
>>>>>> -}
>>>>>> -
>>>>>> -sub check_cleanup {
>>>>>> -  my ($self, $opts) = @_;
>>>>>> -  $self->_check_forked_result($opts->{permsgstatus}, 1);
>>>>>> -}
>>>>>> -
>>>>>> -sub _check_forked_result {
>>>>>> -  my ($self, $pms, $finish) = @_;
>>>>>> -
>>>>>> -  return 0 if !$pms->{pyzor_backchannel};
>>>>>> -  return 0 if !$pms->{pyzor_pid};
>>>>>> +    my ( $self, $permsgstatus, $fulltext ) = @_;
>>>>>> +    my $conf = $self->{main}->{conf};
>>>>>> +    my $timeout = $conf->{pyzor_timeout};
>>>>>> +
>>>>>> +    my $client = ( $self->{'_pyzor_client'} ||= 
>>>>>> Mail::SpamAssassin::Pyzor::Client->new( 'timeout' => $timeout ) );
>>>>>> +    my $digest = Mail::SpamAssassin::Pyzor::Digest::get( $fulltext );
>>>>>> +
>>>>>> +    local $@;
>>>>>> +    my $ref = eval { $client->check($digest); };
>>>>>> +    dbg("pyzor: got response: $client->{'_server_host'}");
>>>>>> +    # $client reply must be an hash
>>>>>> +    return 0 if (not (ref $ref eq ref {}));
>>>>>> +    if ($@) {
>>>>>> +        my $err = $@;
>>>>>> -  my $timer = $self->{main}->time_method("check_pyzor");
>>>>>> +        $err = eval { $err->get_message() } || $err;
>>>>>> -  $pms->{pyzor_abort} = $pms->{deadline_exceeded} || 
>>>>>> $pms->{shortcircuited};
>>>>>> -
>>>>>> -  my $kid_pid = $pms->{pyzor_pid};
>>>>>> -  # if $finish, force waiting for the child
>>>>>> -  my $pid = waitpid($kid_pid, $finish && !$pms->{pyzor_abort} ? 0 : 
>>>>>> WNOHANG);
>>>>>> -  if ($pid == 0) {
>>>>>> -    #dbg("pyzor: child process $kid_pid not finished yet, trying 
>>>>>> later");
>>>>>> -    if ($pms->{pyzor_abort}) {
>>>>>> -      dbg("pyzor: bailing out due to deadline/shortcircuit");
>>>>>> -      kill('TERM', $kid_pid);
>>>>>> -      if (waitpid($kid_pid, WNOHANG) == 0) {
>>>>>> -        sleep(1);
>>>>>> -        if (waitpid($kid_pid, WNOHANG) == 0) {
>>>>>> -          dbg("pyzor: child process $kid_pid still alive, KILL");
>>>>>> -          kill('KILL', $kid_pid);
>>>>>> -          waitpid($kid_pid, 0);
>>>>>> +        warn("pyzor: check failed: $err\n");
>>>>>> +        return 0;
>>>>>> +    } elsif ( defined $ref->{'Code'} and $ref->{'Code'} ne 200 ) {
>>>>>> +        if(defined $ref->{'Code'} and defined $ref->{'Diag'}) {
>>>>>> +          dbg("pyzor: check failed with invalid code: $ref->{'Code'}: 
>>>>>> $ref->{'Diag'}");
>>>>>> +        } else {
>>>>>> +          dbg("pyzor: check failed with undefined code");
>>>>>>           }
>>>>>> -      }
>>>>>> -      delete $pms->{pyzor_pid};
>>>>>> -      delete $pms->{pyzor_backchannel};
>>>>>> +        return 0;
>>>>>>       }
>>>>>> -    return 0;
>>>>>> -  } elsif ($pid == -1) {
>>>>>> -    # child does not exist?
>>>>>> -    dbg("pyzor: child process $kid_pid already handled?");
>>>>>> -    delete $pms->{pyzor_backchannel};
>>>>>> -    return 0;
>>>>>> -  }
>>>>>> -  $pms->rule_ready($pms->{pyzor_rulename}); # mark rule ready for metas
>>>>>> +    my $pyzor_count       = untaint_var($ref->{'Count'}) + 0;
>>>>>> +    my $pyzor_whitelisted = untaint_var($ref->{'WL-Count'}) + 0;
>>>>>> +    my $count_min = $conf->{pyzor_count_min};
>>>>>> +    my $wl_min = $conf->{pyzor_whitelist_min};
>>>>>> -  dbg("pyzor: child process $kid_pid finished, reading results");
>>>>>> +    my $wl_limit = $pyzor_whitelisted >= $wl_min ?
>>>>>> +      $pyzor_count * $conf->{pyzor_whitelist_factor} : 0;
>>>>>> -  my $backmsg;
>>>>>> -  my $ret = sysread($pms->{pyzor_backchannel}->{latest_kid_fh}, 
>>>>>> $backmsg, PIPE_BUF);
>>>>>> -  if (!defined $ret || $ret == 0) {
>>>>>> -    dbg("pyzor: could not read result from child: ".($ret == 0 ? 0 : 
>>>>>> $!));
>>>>>> -    delete $pms->{pyzor_backchannel};
>>>>>> -    return 0;
>>>>>> -  }
>>>>>> -
>>>>>> -  delete $pms->{pyzor_backchannel};
>>>>>> +    $permsgstatus->set_tag('PYZOR', "Reported $pyzor_count times, 
>>>>>> whitelisted $pyzor_whitelisted times.");
>>>>>> -  my $results;
>>>>>> -  eval {
>>>>>> -    $results = Storable::thaw($backmsg);
>>>>>> -  };
>>>>>> -  if ($@) {
>>>>>> -    dbg("pyzor: child return value thaw failed: $@");
>>>>>> -    return;
>>>>>> -  }
>>>>>> -
>>>>>> -  $self->_check_result($pms, $results);
>>>>>> -}
>>>>>> +    dbg("pyzor: result: COUNT=$pyzor_count/$count_min 
>>>>>> WHITELIST=$pyzor_whitelisted/$wl_min/%.1f",
>>>>>> +      $wl_limit);
>>>>>> -sub _check_result {
>>>>>> -  my ($self, $pms, $results) = @_;
>>>>>> -
>>>>>> -  if (!@$results) {
>>>>>> -    dbg("pyzor: no response from server");
>>>>>> -    return 0;
>>>>>> -  }
>>>>>> -
>>>>>> -  my $count = 0;
>>>>>> -  my $count_wl = 0;
>>>>>> -  foreach my $res (@$results) {
>>>>>> -    chomp($res);
>>>>>> -    if ($res =~ /^Traceback/) {
>>>>>> -      info("pyzor: internal error, python traceback seen in response: 
>>>>>> $res");
>>>>>> +    # Empty body etc results in same hash, we should skip very large 
>>>>>> numbers..
>>>>>> +    if ($pyzor_count >= 1000000 || $pyzor_whitelisted >= 10000) {
>>>>>> +      dbg("pyzor: result exceeded hardcoded limits, ignoring: count/wl 
>>>>>> 1000000/10000");
>>>>>>         return 0;
>>>>>>       }
>>>>>> -    dbg("pyzor: got response: $res");
>>>>>> -    # this regexp is intended to be a little bit forgiving
>>>>>> -    if ($res =~ /^\S+\t.*?\t(\d+)\t(\d+)\s*$/) {
>>>>>> -      # until pyzor servers can sync their DBs,
>>>>>> -      # sum counts obtained from all servers
>>>>>> -      $count += untaint_var($1)+0; # crazy but needs untainting
>>>>>> -      $count_wl += untaint_var($2)+0;
>>>>>> -    } else {
>>>>>> -      # warn on failures to parse
>>>>>> -      info("pyzor: failure to parse response \"$res\"");
>>>>>> -    }
>>>>>> -  }
>>>>>> -
>>>>>> -  my $conf = $self->{main}->{conf};
>>>>>> -
>>>>>> -  my $count_min = $conf->{pyzor_count_min};
>>>>>> -  my $wl_min = $conf->{pyzor_whitelist_min};
>>>>>> -  my $wl_limit = $count_wl >= $wl_min ?
>>>>>> -    $count * $conf->{pyzor_whitelist_factor} : 0;
>>>>>> -
>>>>>> -  dbg("pyzor: result: COUNT=$count/$count_min 
>>>>>> WHITELIST=$count_wl/$wl_min/%.1f",
>>>>>> -    $wl_limit);
>>>>>> -  $pms->set_tag('PYZOR', "Reported $count times, whitelisted $count_wl 
>>>>>> times.");
>>>>>> -
>>>>>> -  # Empty body etc results in same hash, we should skip very large 
>>>>>> numbers..
>>>>>> -  if ($count >= 1000000 || $count_wl >= 10000) {
>>>>>> -    dbg("pyzor: result exceeded hardcoded limits, ignoring: count/wl 
>>>>>> 1000000/10000");
>>>>>> -    return 0;
>>>>>> -  }
>>>>>> -
>>>>>> -  # Whitelisted?
>>>>>> -  if ($wl_limit && $count_wl >= $wl_limit) {
>>>>>> -    dbg("pyzor: message whitelisted");
>>>>>> -    return 0;
>>>>>> -  }
>>>>>> +    # Whitelisted?
>>>>>> +    if ($wl_limit && $pyzor_whitelisted >= $wl_limit) {
>>>>>> +      dbg("pyzor: message whitelisted");
>>>>>> +      return 0;
>>>>>> +    }
>>>>>> -  if ($count >= $count_min) {
>>>>>> -    if ($conf->{pyzor_fork}) {
>>>>>> -      # forked needs to run got_hit()
>>>>>> -      $pms->got_hit($pms->{pyzor_rulename}, "", ruletype => 'eval');
>>>>>> +    if ( $pyzor_count >= $count_min ) {
>>>>>> +      return 1;
>>>>>>       }
>>>>>> -    return 1;
>>>>>> -  }
>>>>>> -  return 0;
>>>>>> +    return 0;
>>>>>>   }
>>>>>>   sub plugin_report {
>>>>>>     my ($self, $options) = @_;
>>>>>> -  return if !$self->{pyzor_available};
>>>>>> -  return if !$self->{main}->{conf}->{use_pyzor};
>>>>>> -  return if $options->{report}->{options}->{dont_report_to_pyzor};
>>>>>> -  return if !$self->is_pyzor_available();
>>>>>> -
>>>>>> -  # use temporary file: open2() is unreliable due to buffering under 
>>>>>> spamd
>>>>>> -  my $tmpf = 
>>>>>> $options->{report}->create_fulltext_tmpfile($options->{text});
>>>>>> -  if ($self->pyzor_report($options, $tmpf)) {
>>>>>> -    $options->{report}->{report_available} = 1;
>>>>>> -    info("reporter: spam reported to Pyzor");
>>>>>> -    $options->{report}->{report_return} = 1;
>>>>>> -  }
>>>>>> -  else {
>>>>>> -    info("reporter: could not report spam to Pyzor");
>>>>>> -  }
>>>>>> -  $options->{report}->delete_fulltext_tmpfile($tmpf);
>>>>>> +  return unless $self->{pyzor_available};
>>>>>> +  return unless $self->{main}->{conf}->{use_pyzor};
>>>>>> -  return 1;
>>>>>> +  if (!$options->{report}->{options}->{dont_report_to_pyzor} && 
>>>>>> $self->is_pyzor_available())
>>>>>> +  {
>>>>>> +    if ($self->pyzor_report($options)) {
>>>>>> +      $options->{report}->{report_available} = 1;
>>>>>> +      info("reporter: spam reported to Pyzor");
>>>>>> +      $options->{report}->{report_return} = 1;
>>>>>> +    }
>>>>>> +    else {
>>>>>> +      info("reporter: could not report spam to Pyzor");
>>>>>> +    }
>>>>>> +  }
>>>>>>   }
>>>>>>   sub pyzor_report {
>>>>>> -  my ($self, $options, $tmpf) = @_;
>>>>>> -
>>>>>> -  # note: not really tainted, this came from system configuration file
>>>>>> -  my $path = 
>>>>>> untaint_file_path($options->{report}->{conf}->{pyzor_path});
>>>>>> -  my $opts = untaint_var($options->{report}->{conf}->{pyzor_options}) 
>>>>>> || '';
>>>>>> +    my ( $self, $options ) = @_;
>>>>>> -  my $timeout = $self->{main}->{conf}->{pyzor_timeout};
>>>>>> +    my $timeout = $self->{main}->{conf}->{pyzor_timeout};
>>>>>> -  $options->{report}->enter_helper_run_mode();
>>>>>> +    my $client = ( $self->{'_pyzor_client'} ||= 
>>>>>> Mail::SpamAssassin::Pyzor::Client->new( 'timeout' => $timeout ) );
>>>>>> -  my $timer = Mail::SpamAssassin::Timeout->new({ secs => $timeout });
>>>>>> -  my $err = $timer->run_and_catch(sub {
>>>>>> +    my $digest = Mail::SpamAssassin::Pyzor::Digest::get( 
>>>>>> $options->{'text'} );
>>>>>> -    local $SIG{PIPE} = sub { die "__brokenpipe__ignore__\n" };
>>>>>> -
>>>>>> -    dbg("pyzor: opening pipe: " . join(' ', $path, $opts, "report", "< 
>>>>>> $tmpf"));
>>>>>> -
>>>>>> -    my $pid = Mail::SpamAssassin::Util::helper_app_pipe_open(*PYZOR,
>>>>>> -        $tmpf, 1, $path, split(' ', $opts), "report");
>>>>>> -    $pid or die "$!\n";
>>>>>> -
>>>>>> -    my($inbuf,$nread,$nread_all); $nread_all = 0;
>>>>>> -    # response is ignored, just check its existence
>>>>>> -    while ( $nread=read(PYZOR,$inbuf,8192) ) { $nread_all += $nread }
>>>>>> -    defined $nread  or die "error reading from pipe: $!";
>>>>>> -
>>>>>> -    dbg("pyzor: empty response")  if $nread_all < 1;
>>>>>> -
>>>>>> -    my $errno = 0;  close PYZOR or $errno = $!;
>>>>>> -    # closing a pipe also waits for the process executing on the pipe to
>>>>>> -    # complete, no need to explicitly call waitpid
>>>>>> -    # my $child_stat = waitpid($pid,0) > 0 ? $? : undef;
>>>>>> -    if (proc_status_ok($?,$errno, 0)) {
>>>>>> -      dbg("pyzor: [%s] reporter finished successfully", $pid);
>>>>>> -    } else {
>>>>>> -      info("pyzor: [%s] reporter error: %s", $pid, 
>>>>>> exit_status_str($?,$errno));
>>>>>> +    local $@;
>>>>>> +    my $ref = eval { $client->report($digest); };
>>>>>> +    if ($@) {
>>>>>> +        warn("pyzor: report failed: $@");
>>>>>> +        return 0;
>>>>>>       }
>>>>>> -
>>>>>> -  });
>>>>>> -
>>>>>> -  $options->{report}->leave_helper_run_mode();
>>>>>> -
>>>>>> -  if ($timer->timed_out()) {
>>>>>> -    dbg("reporter: pyzor report timed out after $timeout seconds");
>>>>>> -    return 0;
>>>>>> -  }
>>>>>> -
>>>>>> -  if ($err) {
>>>>>> -    chomp $err;
>>>>>> -    if ($err eq '__brokenpipe__ignore__') {
>>>>>> -      dbg("reporter: pyzor report failed: broken pipe");
>>>>>> -    } else {
>>>>>> -      warn("reporter: pyzor report failed: $err\n");
>>>>>> +    elsif ( $ref->{'Code'} ne 200 ) {
>>>>>> +        dbg("pyzor: report failed with invalid code: $ref->{'Code'}: 
>>>>>> $ref->{'Diag'}");
>>>>>> +        return 0;
>>>>>>       }
>>>>>> -    return 0;
>>>>>> -  }
>>>>>> -  return 1;
>>>>>> +    return 1;
>>>>>>   }
>>>>>> -# Version features
>>>>>> -sub has_fork { 1 }
>>>>>> -
>>>>>>   1;
>>>>>> -
>>>>>> -=back
>>>>>> -
>>>>>> -=cut
>>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor.pm 
>>>>>> b/lib/Mail/SpamAssassin/Pyzor.pm
>>>>>> new file mode 100644
>>>>>> index 0000000..8ac27f4
>>>>>> --- /dev/null
>>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor.pm
>>>>>> @@ -0,0 +1,56 @@
>>>>>> +package Mail::SpamAssassin::Pyzor;
>>>>>> +
>>>>>> +# Copyright 2018 cPanel, LLC.
>>>>>> +# All rights reserved.
>>>>>> +# http://cpanel.net
>>>>>> +#
>>>>>> +# <@LICENSE>
>>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more
>>>>>> +# contributor license agreements.  See the NOTICE file distributed with
>>>>>> +# this work for additional information regarding copyright ownership.
>>>>>> +# The ASF licenses this file to you under the Apache License, Version 
>>>>>> 2.0
>>>>>> +# (the "License"); you may not use this file except in compliance with
>>>>>> +# the License.  You may obtain a copy of the License at:
>>>>>> +#
>>>>>> +#     http://www.apache.org/licenses/LICENSE-2.0
>>>>>> +#
>>>>>> +# Unless required by applicable law or agreed to in writing, software
>>>>>> +# distributed under the License is distributed on an "AS IS" BASIS,
>>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>>>>> implied.
>>>>>> +# See the License for the specific language governing permissions and
>>>>>> +# limitations under the License.
>>>>>> +# </@LICENSE>
>>>>>> +#
>>>>>> +
>>>>>> +use strict;
>>>>>> +use warnings;
>>>>>> +
>>>>>> +our $VERSION = '0.06_01';
>>>>>> +
>>>>>> +=encoding utf-8
>>>>>> +
>>>>>> +=head1 NAME
>>>>>> +
>>>>>> +Mail::SpamAssassin::Pyzor - Pyzor spam filtering in Perl
>>>>>> +
>>>>>> +=head1 DESCRIPTION
>>>>>> +
>>>>>> +This distribution contains Perl implementations of parts of
>>>>>> +L<Pyzor|http://pyzor.org>, a tool for use in spam email filtering.
>>>>>> +It is intended for use with L<Mail::SpamAssassin> but may be useful
>>>>>> +in other contexts.
>>>>>> +
>>>>>> +See the following modules for information on specific tools that
>>>>>> +the distribution includes:
>>>>>> +
>>>>>> +=over
>>>>>> +
>>>>>> +=item * L<Mail::SpamAssassin::Pyzor::Client>
>>>>>> +
>>>>>> +=item * L<Mail::SpamAssassin::Pyzor::Digest>
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +1;
>>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor/Client.pm 
>>>>>> b/lib/Mail/SpamAssassin/Pyzor/Client.pm
>>>>>> new file mode 100644
>>>>>> index 0000000..ccff868
>>>>>> --- /dev/null
>>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor/Client.pm
>>>>>> @@ -0,0 +1,415 @@
>>>>>> +package Mail::SpamAssassin::Pyzor::Client;
>>>>>> +
>>>>>> +# Copyright 2018 cPanel, LLC.
>>>>>> +# All rights reserved.
>>>>>> +# http://cpanel.net
>>>>>> +#
>>>>>> +# <@LICENSE>
>>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more
>>>>>> +# contributor license agreements.  See the NOTICE file distributed with
>>>>>> +# this work for additional information regarding copyright ownership.
>>>>>> +# The ASF licenses this file to you under the Apache License, Version 
>>>>>> 2.0
>>>>>> +# (the "License"); you may not use this file except in compliance with
>>>>>> +# the License.  You may obtain a copy of the License at:
>>>>>> +#
>>>>>> +#     http://www.apache.org/licenses/LICENSE-2.0
>>>>>> +#
>>>>>> +# Unless required by applicable law or agreed to in writing, software
>>>>>> +# distributed under the License is distributed on an "AS IS" BASIS,
>>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>>>>> implied.
>>>>>> +# See the License for the specific language governing permissions and
>>>>>> +# limitations under the License.
>>>>>> +# </@LICENSE>
>>>>>> +#
>>>>>> +
>>>>>> +use strict;
>>>>>> +use warnings;
>>>>>> +
>>>>>> +=encoding utf-8
>>>>>> +
>>>>>> +=head1 NAME
>>>>>> +
>>>>>> +Mail::SpamAssassin::Pyzor::Client - Pyzor client logic
>>>>>> +
>>>>>> +=head1 SYNOPSIS
>>>>>> +
>>>>>> +    use Mail::SpamAssassin::Pyzor::Client ();
>>>>>> +    use Mail::SpamAssassin::Pyzor::Digest ();
>>>>>> +
>>>>>> +    my $client = Mail::SpamAssassin::Pyzor::Client->new();
>>>>>> +
>>>>>> +    my $digest = Mail::SpamAssassin::Pyzor::Digest::get( $msg );
>>>>>> +
>>>>>> +    my $check_ref = $client->check($digest);
>>>>>> +    die $check_ref->{'Diag'} if $check_ref->{'Code'} ne '200';
>>>>>> +
>>>>>> +    my $report_ref = $client->report($digest);
>>>>>> +    die $report_ref->{'Diag'} if $report_ref->{'Code'} ne '200';
>>>>>> +
>>>>>> +=head1 DESCRIPTION
>>>>>> +
>>>>>> +A bare-bones L<Pyzor|http://pyzor.org> client that currently only
>>>>>> +implements the functionality needed for L<Mail::SpamAssassin>.
>>>>>> +
>>>>>> +=head1 PROTOCOL DETAILS
>>>>>> +
>>>>>> +The Pyzor protocol is not a published standard, and there appears to be
>>>>>> +no meaningful public documentation. What follows is enough information,
>>>>>> +largely gleaned through forum posts and reverse engineering, to 
>>>>>> facilitate
>>>>>> +effective use of this module:
>>>>>> +
>>>>>> +Pyzor is an RPC-oriented, message-based protocol. Each message
>>>>>> +is a simple dictionary of 7-bit ASCII keys and values. Server responses
>>>>>> +always include at least the following:
>>>>>> +
>>>>>> +=over
>>>>>> +
>>>>>> +=item * C<Code> - Similar to HTTP status codes; anything besides C<200>
>>>>>> +is an error.
>>>>>> +
>>>>>> +=item * C<Diag> - Similar to HTTP status reasons: a text description
>>>>>> +of the status.
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +(NB: There are additional standard response headers that are useful 
>>>>>> only for
>>>>>> +the protocol itself and thus are not part of this module???s returns.)
>>>>>> +
>>>>>> +=head2 Reliability
>>>>>> +
>>>>>> +Pyzor uses UDP rather than TCP, so no message is guaranteed to reach its
>>>>>> +destination. A transmission failure can happen in either the request or
>>>>>> +the response; in either case, a timeout error will result. Such errors
>>>>>> +are represented as thrown instances of L<Mail::Pyzor::X::Timeout>.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +our $VERSION = '0.04';
>>>>>> +
>>>>>> +our $DEFAULT_SERVER_HOST    = 'public.pyzor.org';
>>>>>> +our $DEFAULT_SERVER_PORT    = 24441;
>>>>>> +our $DEFAULT_USERNAME       = 'anonymous';
>>>>>> +our $DEFAULT_PASSWORD       = '';
>>>>>> +our $DEFAULT_OP_SPEC        = '20,3,60,3';
>>>>>> +our $PYZOR_PROTOCOL_VERSION = 2.1;
>>>>>> +our $DEFAULT_TIMEOUT        = 3.5;
>>>>>> +our $READ_SIZE              = 8192;
>>>>>> +
>>>>>> +use IO::Socket::INET ();
>>>>>> +use Digest::SHA qw(sha1 sha1_hex);
>>>>>> +
>>>>>> +my @hash_order = ( 'Op', 'Op-Digest', 'Op-Spec', 'Thread', 'PV', 
>>>>>> 'User', 'Time', 'Sig' );
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head1 CONSTRUCTOR
>>>>>> +
>>>>>> +=head2 new(%OPTS)
>>>>>> +
>>>>>> +Create a new pyzor client.
>>>>>> +
>>>>>> +=over 2
>>>>>> +
>>>>>> +=item Input
>>>>>> +
>>>>>> +%OPTS are (all optional):
>>>>>> +
>>>>>> +=over 3
>>>>>> +
>>>>>> +=item * C<server_host> - The pyzor server host to connect to (default is
>>>>>> +C<public.pyzor.org>)
>>>>>> +
>>>>>> +=item * C<server_port> - The pyzor server port to connect to (default is
>>>>>> +24441)
>>>>>> +
>>>>>> +=item * C<username> - The username to present to the pyzor server 
>>>>>> (default
>>>>>> +is C<anonymous>)
>>>>>> +
>>>>>> +=item * C<password> - The password to present to the pyzor server 
>>>>>> (default
>>>>>> +is empty)
>>>>>> +
>>>>>> +=item * C<timeout> - The maximum time, in seconds, to wait for a 
>>>>>> response
>>>>>> +from the pyzor server (defeault is 3.5)
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=item Output
>>>>>> +
>>>>>> +=over 3
>>>>>> +
>>>>>> +Returns a L<Mail::SpamAssassin::Pyzor::Client> object.
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub new {
>>>>>> +    my ( $class, %OPTS ) = @_;
>>>>>> +
>>>>>> +    return bless {
>>>>>> +        '_server_host' => $OPTS{'server_host'} || $DEFAULT_SERVER_HOST,
>>>>>> +        '_server_port' => $OPTS{'server_port'} || $DEFAULT_SERVER_PORT,
>>>>>> +        '_username'    => $OPTS{'username'}    || $DEFAULT_USERNAME,
>>>>>> +        '_password'    => $OPTS{'password'}    || $DEFAULT_PASSWORD,
>>>>>> +        '_op_spec'     => $DEFAULT_OP_SPEC,
>>>>>> +        '_timeout'     => $OPTS{'timeout'} || $DEFAULT_TIMEOUT,
>>>>>> +    }, $class;
>>>>>> +}
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head1 REQUEST METHODS
>>>>>> +
>>>>>> +=head2 report($digest)
>>>>>> +
>>>>>> +Report the digest of a spam message to the pyzor server. This function
>>>>>> +will throw if a messaging failure or timeout happens.
>>>>>> +
>>>>>> +=over 2
>>>>>> +
>>>>>> +=item Input
>>>>>> +
>>>>>> +=over 3
>>>>>> +
>>>>>> +=item $digest C<SCALAR>
>>>>>> +
>>>>>> +The message digest to report, as given by
>>>>>> +C<Mail::SpamAssassin::Pyzor::Digest::get()>.
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=item Output
>>>>>> +
>>>>>> +=over 3
>>>>>> +
>>>>>> +=item C<HASHREF>
>>>>>> +
>>>>>> +Returns a hashref of the standard attributes noted above.
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub report {
>>>>>> +    my ( $self, $digest ) = @_;
>>>>>> +
>>>>>> +    my $msg_ref = $self->_get_base_msg( 'report', $digest );
>>>>>> +
>>>>>> +    $msg_ref->{'Op-Spec'} = $self->{'_op_spec'};
>>>>>> +
>>>>>> +    return $self->_send_receive_msg($msg_ref);
>>>>>> +}
>>>>>> +
>>>>>> +=head2 check($digest)
>>>>>> +
>>>>>> +Check the digest of a message to see if
>>>>>> +the pyzor server has a report for it. This function
>>>>>> +will throw if a messaging failure or timeout happens.
>>>>>> +
>>>>>> +=over 2
>>>>>> +
>>>>>> +=item Input
>>>>>> +
>>>>>> +=over 3
>>>>>> +
>>>>>> +=item $digest C<SCALAR>
>>>>>> +
>>>>>> +The message digest to check, as given by
>>>>>> +C<Mail::SpamAssassin::Pyzor::Digest::get()>.
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=item Output
>>>>>> +
>>>>>> +=over 3
>>>>>> +
>>>>>> +=item C<HASHREF>
>>>>>> +
>>>>>> +Returns a hashref of the standard attributes noted above
>>>>>> +as well as the following:
>>>>>> +
>>>>>> +=over
>>>>>> +
>>>>>> +=item * C<Count> - The number of reports the server has received
>>>>>> +for the given digest.
>>>>>> +
>>>>>> +=item * C<WL-Count> - The number of whitelist requests the server has 
>>>>>> received
>>>>>> +for the given digest.
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=back
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub check {
>>>>>> +    my ( $self, $digest ) = @_;
>>>>>> +
>>>>>> +    return $self->_send_receive_msg( $self->_get_base_msg( 'check', 
>>>>>> $digest ) );
>>>>>> +}
>>>>>> +
>>>>>> +# ----------------------------------------
>>>>>> +
>>>>>> +sub _send_receive_msg {
>>>>>> +    my ( $self, $msg_ref ) = @_;
>>>>>> +
>>>>>> +    my $thread_id = $msg_ref->{'Thread'} or warn 'No thread ID?';
>>>>>> +
>>>>>> +    $self->_sign_msg($msg_ref);
>>>>>> +
>>>>>> +    return $self->_do_send_receive(
>>>>>> +        $self->_generate_packet_from_message($msg_ref) . "\n\n",
>>>>>> +        $thread_id,
>>>>>> +    );
>>>>>> +}
>>>>>> +
>>>>>> +sub _get_base_msg {
>>>>>> +    my ( $self, $op, $digest ) = @_;
>>>>>> +
>>>>>> +    die "Implementor error: op is required" if !$op;
>>>>>> +    die "error: digest is required"         if !$digest;
>>>>>> +
>>>>>> +    return {
>>>>>> +        'User'      => $self->{'_username'},
>>>>>> +        'PV'        => $PYZOR_PROTOCOL_VERSION,
>>>>>> +        'Time'      => time(),
>>>>>> +        'Op'        => $op,
>>>>>> +        'Op-Digest' => $digest,
>>>>>> +        'Thread'    => $self->_generate_thread_id()
>>>>>> +    };
>>>>>> +}
>>>>>> +
>>>>>> +sub _do_send_receive {
>>>>>> +    my ( $self, $packet, $thread_id ) = @_;
>>>>>> +
>>>>>> +    my $sock = $self->_get_connection_or_die();
>>>>>> +
>>>>>> +    $self->_send_packet( $sock, $packet );
>>>>>> +    my $response = $self->_receive_packet( $sock, $thread_id );
>>>>>> +
>>>>>> +    return 0 if not defined $response;
>>>>>> +
>>>>>> +    my $resp_hr = { map { ( split(m{: }) )[ 0, 1 ] } split( m{\n}, 
>>>>>> $response ) };
>>>>>> +
>>>>>> +    delete $resp_hr->{'Thread'};
>>>>>> +
>>>>>> +    my $response_pv = delete $resp_hr->{'PV'};
>>>>>> +
>>>>>> +    if ( $PYZOR_PROTOCOL_VERSION ne $response_pv ) {
>>>>>> +        warn "Unexpected protocol version ($response_pv) in Pyzor 
>>>>>> response!";
>>>>>> +    }
>>>>>> +
>>>>>> +    return $resp_hr;
>>>>>> +}
>>>>>> +
>>>>>> +sub _receive_packet {
>>>>>> +    my ( $self, $sock, $thread_id ) = @_;
>>>>>> +
>>>>>> +    my $timeout = $self->{'_timeout'} * 1000;
>>>>>> +
>>>>>> +    my $end_time = time + $self->{'_timeout'};
>>>>>> +
>>>>>> +    $sock->blocking(0);
>>>>>> +    my $response = '';
>>>>>> +    my $rout     = '';
>>>>>> +    my $rin      = '';
>>>>>> +    vec( $rin, fileno($sock), 1 ) = 1;
>>>>>> +
>>>>>> +    while (1) {
>>>>>> +        my $time_left = $end_time - time;
>>>>>> +
>>>>>> +        if ( $time_left <= 0 ) {
>>>>>> +          warn("Did not receive a response from the pyzor server 
>>>>>> $self->{'_server_host'}:$self->{'_server_port'} for $self->{'_timeout'} 
>>>>>> seconds!");
>>>>>> +          return;
>>>>>> +        }
>>>>>> +
>>>>>> +        my $bytes = sysread( $sock, $response, $READ_SIZE, length 
>>>>>> $response );
>>>>>> +        if ( !defined($bytes) && !$!{'EAGAIN'} && !$!{'EWOULDBLOCK'} ) {
>>>>>> +            warn "read from socket: $!";
>>>>>> +        }
>>>>>> +
>>>>>> +        if ( index( $response, "\n\n" ) > -1 ) {
>>>>>> +
>>>>>> +            # Reject the response unless its thread ID matches what we 
>>>>>> sent.
>>>>>> +            # This prevents confusion among concurrent Pyzor reqeusts.
>>>>>> +            if ( index( $response, "\nThread: $thread_id\n" ) != -1 ) {
>>>>>> +                last;
>>>>>> +            }
>>>>>> +            else {
>>>>>> +                $response = '';
>>>>>> +            }
>>>>>> +        }
>>>>>> +
>>>>>> +        my $found = select( $rout = $rin, undef, undef, $time_left );
>>>>>> +        warn "select(): $!" if $found == -1;
>>>>>> +    }
>>>>>> +
>>>>>> +    return $response;
>>>>>> +}
>>>>>> +
>>>>>> +sub _send_packet {
>>>>>> +    my ( $self, $sock, $packet ) = @_;
>>>>>> +
>>>>>> +    $sock->blocking(1);
>>>>>> +    syswrite( $sock, $packet ) or warn "write to socket: $!";
>>>>>> +
>>>>>> +    return;
>>>>>> +}
>>>>>> +
>>>>>> +sub _get_connection_or_die {
>>>>>> +    my ($self) = @_;
>>>>>> +
>>>>>> +    # clear the socket if the PID changes
>>>>>> +    if ( defined $self->{'_sock_pid'} && $self->{'_sock_pid'} != $$ ) {
>>>>>> +        undef $self->{'_sock_pid'};
>>>>>> +        undef $self->{'_sock'};
>>>>>> +    }
>>>>>> +
>>>>>> +    $self->{'_sock_pid'} ||= $$;
>>>>>> +    $self->{'_sock'}     ||= IO::Socket::INET->new(
>>>>>> +        'PeerHost' => $self->{'_server_host'},
>>>>>> +        'PeerPort' => $self->{'_server_port'},
>>>>>> +        'Proto'    => 'udp'
>>>>>> +    ) or die "Cannot connect to 
>>>>>> $self->{'_server_host'}:$self->{'_server_port'}: $@ $!";
>>>>>> +
>>>>>> +    return $self->{'_sock'};
>>>>>> +}
>>>>>> +
>>>>>> +sub _sign_msg {
>>>>>> +    my ( $self, $msg_ref ) = @_;
>>>>>> +
>>>>>> +    $msg_ref->{'Sig'} = lc Digest::SHA::sha1_hex(
>>>>>> +        Digest::SHA::sha1( 
>>>>>> $self->_generate_packet_from_message($msg_ref) )
>>>>>> +    );
>>>>>> +
>>>>>> +    return 1;
>>>>>> +}
>>>>>> +
>>>>>> +sub _generate_packet_from_message {
>>>>>> +    my ( $self, $msg_ref ) = @_;
>>>>>> +
>>>>>> +    return join( "\n", map { "$_: $msg_ref->{$_}" } grep { length 
>>>>>> $msg_ref->{$_} } @hash_order );
>>>>>> +}
>>>>>> +
>>>>>> +sub _generate_thread_id {
>>>>>> +    my $RAND_MAX = 2**16;
>>>>>> +    my $val      = 0;
>>>>>> +    $val = int rand($RAND_MAX) while $val < 1024;
>>>>>> +    return $val;
>>>>>> +}
>>>>>> +
>>>>>> +sub _get_user_pass_hash_key {
>>>>>> +    my ($self) = @_;
>>>>>> +
>>>>>> +    return lc Digest::SHA::sha1_hex( $self->{'_username'} . ':' . 
>>>>>> $self->{'_password'} );
>>>>>> +}
>>>>>> +
>>>>>> +1;
>>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor/Digest.pm 
>>>>>> b/lib/Mail/SpamAssassin/Pyzor/Digest.pm
>>>>>> new file mode 100644
>>>>>> index 0000000..0e8a5ae
>>>>>> --- /dev/null
>>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor/Digest.pm
>>>>>> @@ -0,0 +1,103 @@
>>>>>> +package Mail::SpamAssassin::Pyzor::Digest;
>>>>>> +
>>>>>> +# Copyright 2018 cPanel, LLC.
>>>>>> +# All rights reserved.
>>>>>> +# http://cpanel.net
>>>>>> +#
>>>>>> +# <@LICENSE>
>>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more
>>>>>> +# contributor license agreements.  See the NOTICE file distributed with
>>>>>> +# this work for additional information regarding copyright ownership.
>>>>>> +# The ASF licenses this file to you under the Apache License, Version 
>>>>>> 2.0
>>>>>> +# (the "License"); you may not use this file except in compliance with
>>>>>> +# the License.  You may obtain a copy of the License at:
>>>>>> +#
>>>>>> +#     http://www.apache.org/licenses/LICENSE-2.0
>>>>>> +#
>>>>>> +# Unless required by applicable law or agreed to in writing, software
>>>>>> +# distributed under the License is distributed on an "AS IS" BASIS,
>>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>>>>> implied.
>>>>>> +# See the License for the specific language governing permissions and
>>>>>> +# limitations under the License.
>>>>>> +# </@LICENSE>
>>>>>> +#
>>>>>> +
>>>>>> +use strict;
>>>>>> +use warnings;
>>>>>> +
>>>>>> +=encoding utf-8
>>>>>> +
>>>>>> +=head1 NAME
>>>>>> +
>>>>>> +Mail::SpamAssassin::Pyzor::Digest
>>>>>> +
>>>>>> +=head1 SYNOPSIS
>>>>>> +
>>>>>> +    my $digest = Mail::SpamAssassin::Pyzor::Digest::get( $mime_text );
>>>>>> +
>>>>>> +=head1 DESCRIPTION
>>>>>> +
>>>>>> +A reimplementation of 
>>>>>> L<https://github.com/SpamExperts/pyzor/blob/master/pyzor/digest.py>.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +use Email::MIME ();
>>>>>> +
>>>>>> +use Mail::SpamAssassin::Pyzor::Digest::Pieces ();
>>>>>> +use Digest::SHA qw(sha1_hex);
>>>>>> +
>>>>>> +our $VERSION = '0.03';
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head1 FUNCTIONS
>>>>>> +
>>>>>> +=head2 $hex = get( $MSG )
>>>>>> +
>>>>>> +This takes an email message in raw MIME text format (i.e., as saved in 
>>>>>> the
>>>>>> +standard mbox format) and returns the message???s Pyzor digest in 
>>>>>> lower-case
>>>>>> +hexadecimal.
>>>>>> +
>>>>>> +The output from this function should normally be identical to that of
>>>>>> +the C<pyzor> script???s C<digest> command. It is suitable for use in
>>>>>> +L<Mail::SpamAssassin::Pyzor::Client>???s request methods.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub get {
>>>>>> +    my ($text) = @_;
>>>>>> +    return Digest::SHA::sha1_hex( ${ _get_predigest( $text ) } );
>>>>>> +}
>>>>>> +
>>>>>> +# NB: This is called from the test.
>>>>>> +sub _get_predigest {    ## no critic qw(RequireArgUnpacking)
>>>>>> +    my ($msg_text_sr) = @_;
>>>>>> +
>>>>>> +    my $parsed = Email::MIME->new($$msg_text_sr);
>>>>>> +
>>>>>> +    my @lines;
>>>>>> +
>>>>>> +    my $payloads_ar = 
>>>>>> Mail::SpamAssassin::Pyzor::Digest::Pieces::digest_payloads($parsed);
>>>>>> +
>>>>>> +    for my $payload (@$payloads_ar) {
>>>>>> +        my @p_lines = 
>>>>>> Mail::SpamAssassin::Pyzor::Digest::Pieces::splitlines($payload);
>>>>>> +        for my $line (@p_lines) {
>>>>>> +            Mail::SpamAssassin::Pyzor::Digest::Pieces::normalize($line);
>>>>>> +
>>>>>> +            next if 
>>>>>> !Mail::SpamAssassin::Pyzor::Digest::Pieces::should_handle_line($line);
>>>>>> +
>>>>>> +            # Make sure we have an octet string.
>>>>>> +            utf8::encode($line) if utf8::is_utf8($line);
>>>>>> +
>>>>>> +            push @lines, $line;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    my $digest_sr = 
>>>>>> Mail::SpamAssassin::Pyzor::Digest::Pieces::assemble_lines( \@lines );
>>>>>> +
>>>>>> +    return $digest_sr;
>>>>>> +}
>>>>>> +
>>>>>> +1;
>>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor/Digest/Pieces.pm 
>>>>>> b/lib/Mail/SpamAssassin/Pyzor/Digest/Pieces.pm
>>>>>> new file mode 100644
>>>>>> index 0000000..522accd
>>>>>> --- /dev/null
>>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor/Digest/Pieces.pm
>>>>>> @@ -0,0 +1,301 @@
>>>>>> +package Mail::SpamAssassin::Pyzor::Digest::Pieces;
>>>>>> +
>>>>>> +# Copyright 2018 cPanel, LLC.
>>>>>> +# All rights reserved.
>>>>>> +# http://cpanel.net
>>>>>> +#
>>>>>> +# <@LICENSE>
>>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more
>>>>>> +# contributor license agreements.  See the NOTICE file distributed with
>>>>>> +# this work for additional information regarding copyright ownership.
>>>>>> +# The ASF licenses this file to you under the Apache License, Version 
>>>>>> 2.0
>>>>>> +# (the "License"); you may not use this file except in compliance with
>>>>>> +# the License.  You may obtain a copy of the License at:
>>>>>> +#
>>>>>> +#     http://www.apache.org/licenses/LICENSE-2.0
>>>>>> +#
>>>>>> +# Unless required by applicable law or agreed to in writing, software
>>>>>> +# distributed under the License is distributed on an "AS IS" BASIS,
>>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>>>>> implied.
>>>>>> +# See the License for the specific language governing permissions and
>>>>>> +# limitations under the License.
>>>>>> +# </@LICENSE>
>>>>>> +#
>>>>>> +
>>>>>> +use strict;
>>>>>> +use warnings;
>>>>>> +
>>>>>> +=encoding utf-8
>>>>>> +
>>>>>> +=head1 NAME
>>>>>> +
>>>>>> +Mail::SpamAssassin::Pyzor::Digest::Pieces
>>>>>> +
>>>>>> +=head1 DESCRIPTION
>>>>>> +
>>>>>> +This module houses backend logic for 
>>>>>> L<Mail::SpamAssassin::Pyzor::Digest>.
>>>>>> +
>>>>>> +It reimplements logic found in pyzor???s F<digest.py> module
>>>>>> +(L<https://github.com/SpamExperts/pyzor/blob/master/pyzor/digest.py>).
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +use Email::MIME::ContentType ();
>>>>>> +use Encode                   ();
>>>>>> +
>>>>>> +our $VERSION = '0.03';
>>>>>> +
>>>>>> +# each tuple is [ offset, length ]
>>>>>> +use constant _HASH_SPEC => ( [ 20, 3 ], [ 60, 3 ] );
>>>>>> +
>>>>>> +use constant {
>>>>>> +    _MIN_LINE_LENGTH => 8,
>>>>>> +
>>>>>> +    _ATOMIC_NUM_LINES => 4,
>>>>>> +};
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head1 FUNCTIONS
>>>>>> +
>>>>>> +=head2 $strings_ar = digest_payloads( $EMAIL_MIME )
>>>>>> +
>>>>>> +This imitates the corresponding object method in F<digest.py>.
>>>>>> +It returns a reference to an array of strings. Each string can be either
>>>>>> +a byte string or a character string (e.g., UTF-8 decoded).
>>>>>> +
>>>>>> +NB: RFC 2822 stipulates that message bodies should use CRLF
>>>>>> +line breaks, not plain LF (nor plain CR). L<Email::MIME::Encodings>
>>>>>> +will thus convert any plain CRs in a quoted-printable message
>>>>>> +body into CRLF. Python, though, doesn???t do this, so the output of
>>>>>> +our implementation of C<digest_payloads()> diverges from that of the 
>>>>>> Python
>>>>>> +original. It doesn???t ultimately make a difference since the 
>>>>>> line-ending
>>>>>> +whitespace gets trimmed regardless, but it???s necessary to factor in 
>>>>>> when
>>>>>> +comparing the output of our implementation with the Python output.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub digest_payloads {
>>>>>> +    my ($parsed) = @_;
>>>>>> +
>>>>>> +    my @subparts = $parsed->subparts();
>>>>>> +
>>>>>> +    my @payloads;
>>>>>> +
>>>>>> +    if (@subparts) {
>>>>>> +        @payloads = map { @{ digest_payloads($_) } } 
>>>>>> $parsed->subparts();
>>>>>> +    }
>>>>>> +    else {
>>>>>> +        my ( $main_type, $subtype, $encoding, $encode_check ) = 
>>>>>> parse_content_type( $parsed->content_type() );
>>>>>> +
>>>>>> +        my $payload;
>>>>>> +
>>>>>> +        if ( $main_type eq 'text' ) {
>>>>>> +
>>>>>> +            # Decode transfer encoding, but leave us as a byte string.
>>>>>> +            # Note that this is where Email::MIME converts plain LF to 
>>>>>> CRLF.
>>>>>> +            $payload = $parsed->body();
>>>>>> +
>>>>>> +            # This does the actual character decoding (i.e., 
>>>>>> ???charset???).
>>>>>> +            $payload = Encode::decode( $encoding, $payload, 
>>>>>> $encode_check );
>>>>>> +
>>>>>> +            if ( $subtype eq 'html' ) {
>>>>>> +                require Mail::SpamAssassin::Pyzor::Digest::StripHtml;
>>>>>> +                $payload = 
>>>>>> Mail::SpamAssassin::Pyzor::Digest::StripHtml::strip($payload);
>>>>>> +            }
>>>>>> +        }
>>>>>> +        else {
>>>>>> +
>>>>>> +            # This does no decoding, even of, e.g., quoted-printable or 
>>>>>> base64.
>>>>>> +            $payload = $parsed->body_raw();
>>>>>> +        }
>>>>>> +
>>>>>> +        push @payloads, $payload;
>>>>>> +    }
>>>>>> +
>>>>>> +    return \@payloads;
>>>>>> +}
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head2 normalize( $STRING )
>>>>>> +
>>>>>> +This imitates the corresponding object method in F<digest.py>.
>>>>>> +It modifies C<$STRING> in-place.
>>>>>> +
>>>>>> +As with the original implementation, if C<$STRING> contains (decoded)
>>>>>> +Unicode characters, those characters will be parsed accordingly. So:
>>>>>> +
>>>>>> +    $str = "123\xc2\xa0";   # [ c2 a0 ] == \u00a0, non-breaking space
>>>>>> +
>>>>>> +    normalize($str);
>>>>>> +
>>>>>> +The above will leave C<$str> alone, but this:
>>>>>> +
>>>>>> +    utf8::decode($str);
>>>>>> +
>>>>>> +    normalize($str);
>>>>>> +
>>>>>> +??? will trim off the last two bytes from C<$str>.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub normalize {    ## no critic qw( Subroutines::RequireArgUnpacking )
>>>>>> +
>>>>>> +    # NULs are bad, mm-kay?
>>>>>> +    $_[0] =~ tr<\0><>d;
>>>>>> +
>>>>>> +    # NB: Python???s \s without re.UNICODE is the same as Perl???s \s
>>>>>> +    # with the /a modifier.
>>>>>> +    #
>>>>>> +    # https://docs.python.org/2/library/re.html
>>>>>> +    # https://perldoc.perl.org/perlrecharclass.html#Backslash-sequences
>>>>>> +
>>>>>> +    # Python: re.compile(r'\S{10,}')
>>>>>> +    $_[0] =~ s<\S{10,}><>ag;
>>>>>> +
>>>>>> +    # Python: re.compile(r'\S+@\S+')
>>>>>> +    $_[0] =~ s<\S+ @ \S+><>agx;
>>>>>> +
>>>>>> +    # Python: re.compile(r'[a-z]+:\S+', re.IGNORECASE)
>>>>>> +    $_[0] =~ s<[a-zA-Z]+ : \S+><>agx;
>>>>>> +
>>>>>> +    # (from digest.py ???)
>>>>>> +    # Make sure we do the whitespace last because some of the previous
>>>>>> +    # patterns rely on whitespace.
>>>>>> +    $_[0] =~ tr< \x09-\x0d><>d;
>>>>>> +
>>>>>> +    # This is fun. digest.py???s normalize() does a non-UNICODE 
>>>>>> whitespace
>>>>>> +    # strip, then calls strip() on the string, which *will* strip 
>>>>>> Unicode
>>>>>> +    # whitespace from the ends.
>>>>>> +    $_[0] =~ s<\A\s+><>;
>>>>>> +    $_[0] =~ s<\s+\z><>;
>>>>>> +
>>>>>> +    return;
>>>>>> +}
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head2 $yn = should_handle_line( $STRING )
>>>>>> +
>>>>>> +This imitates the corresponding object method in F<digest.py>.
>>>>>> +It returns a boolean.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub should_handle_line {
>>>>>> +    return $_[0] && length( $_[0] ) >= _MIN_LINE_LENGTH();
>>>>>> +}
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head2 $sr = assemble_lines( \@LINES )
>>>>>> +
>>>>>> +This assembles a string buffer out of @LINES. The string is the buffer
>>>>>> +of octets that will be hashed to produce the message digest.
>>>>>> +
>>>>>> +Each member of @LINES is expected to be an B<octet string>, not a
>>>>>> +character string.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub assemble_lines {
>>>>>> +    my ($lines_ar) = @_;
>>>>>> +
>>>>>> +    if ( @$lines_ar <= _ATOMIC_NUM_LINES() ) {
>>>>>> +
>>>>>> +        # cf. handle_atomic() in digest.py
>>>>>> +        return \join( q<>, @$lines_ar );
>>>>>> +    }
>>>>>> +
>>>>>> +    
>>>>>> #----------------------------------------------------------------------
>>>>>> +    # cf. handle_atomic() in digest.py
>>>>>> +
>>>>>> +    my $str = q<>;
>>>>>> +
>>>>>> +    for my $ofs_len ( _HASH_SPEC() ) {
>>>>>> +        my ( $offset, $length ) = @$ofs_len;
>>>>>> +
>>>>>> +        for my $i ( 0 .. ( $length - 1 ) ) {
>>>>>> +            my $idx = int( $offset * @$lines_ar / 100 ) + $i;
>>>>>> +
>>>>>> +            next if !defined $lines_ar->[$idx];
>>>>>> +
>>>>>> +            $str .= $lines_ar->[$idx];
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    return \$str;
>>>>>> +}
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head2 ($main, $sub, $encoding, $checkval) = parse_content_type( 
>>>>>> $CONTENT_TYPE )
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +use constant _QUOTED_PRINTABLE_NAMES => (
>>>>>> +    "quopri-codec",
>>>>>> +    "quopri",
>>>>>> +    "quoted-printable",
>>>>>> +    "quotedprintable",
>>>>>> +);
>>>>>> +
>>>>>> +# Make Encode::decode() ignore anything that doesn???t fit the
>>>>>> +# given encoding.
>>>>>> +use constant _encode_check_ignore => q<>;
>>>>>> +
>>>>>> +sub parse_content_type {
>>>>>> +    my ($content_type) = @_;
>>>>>> +
>>>>>> +    $Email::MIME::ContentType::STRICT_PARAMS = 0;
>>>>>> +    my $ct_parse = Email::MIME::ContentType::parse_content_type(
>>>>>> +        $content_type,
>>>>>> +    );
>>>>>> +
>>>>>> +    my $main = $ct_parse->{'type'}    || q<>;
>>>>>> +    my $sub  = $ct_parse->{'subtype'} || q<>;
>>>>>> +
>>>>>> +    my $encoding = $ct_parse->{'attributes'}{'charset'};
>>>>>> +
>>>>>> +    my $checkval;
>>>>>> +
>>>>>> +    if ($encoding) {
>>>>>> +
>>>>>> +        # Lower-case everything, convert underscore to dash, and remove 
>>>>>> NUL.
>>>>>> +        $encoding =~ tr<A-Z_\0><a-z->d;
>>>>>> +
>>>>>> +        # Apparently pyzor accommodates messages that put the transfer
>>>>>> +        # encoding in the Content-Type.
>>>>>> +        if ( grep { $_ eq $encoding } _QUOTED_PRINTABLE_NAMES() ) {
>>>>>> +            $checkval = Encode::FB_CROAK();
>>>>>> +        }
>>>>>> +    }
>>>>>> +    else {
>>>>>> +        $encoding = 'ascii';
>>>>>> +    }
>>>>>> +
>>>>>> +    # Match Python .decode()???s 'ignore' behavior
>>>>>> +    $checkval ||= \&_encode_check_ignore;
>>>>>> +
>>>>>> +    return ( $main, $sub, $encoding, $checkval );
>>>>>> +}
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head2 @lines = splitlines( $TEXT )
>>>>>> +
>>>>>> +Imitates C<str.splitlines()>. (cf. C<pydoc str>)
>>>>>> +
>>>>>> +Returns a plain list in list context. Returns the number of
>>>>>> +items to be returned in scalar context.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub splitlines {
>>>>>> +    return split m<\r\n?|\n>, $_[0];
>>>>>> +}
>>>>>> +
>>>>>> +1;
>>>>>> diff --git a/lib/Mail/SpamAssassin/Pyzor/Digest/StripHtml.pm 
>>>>>> b/lib/Mail/SpamAssassin/Pyzor/Digest/StripHtml.pm
>>>>>> new file mode 100644
>>>>>> index 0000000..2617b4a
>>>>>> --- /dev/null
>>>>>> +++ b/lib/Mail/SpamAssassin/Pyzor/Digest/StripHtml.pm
>>>>>> @@ -0,0 +1,177 @@
>>>>>> +package Mail::SpamAssassin::Pyzor::Digest::StripHtml;
>>>>>> +
>>>>>> +# Copyright 2018 cPanel, LLC.
>>>>>> +# All rights reserved.
>>>>>> +# http://cpanel.net
>>>>>> +#
>>>>>> +# <@LICENSE>
>>>>>> +# Licensed to the Apache Software Foundation (ASF) under one or more
>>>>>> +# contributor license agreements.  See the NOTICE file distributed with
>>>>>> +# this work for additional information regarding copyright ownership.
>>>>>> +# The ASF licenses this file to you under the Apache License, Version 
>>>>>> 2.0
>>>>>> +# (the "License"); you may not use this file except in compliance with
>>>>>> +# the License.  You may obtain a copy of the License at:
>>>>>> +#
>>>>>> +#     http://www.apache.org/licenses/LICENSE-2.0
>>>>>> +#
>>>>>> +# Unless required by applicable law or agreed to in writing, software
>>>>>> +# distributed under the License is distributed on an "AS IS" BASIS,
>>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>>>>> implied.
>>>>>> +# See the License for the specific language governing permissions and
>>>>>> +# limitations under the License.
>>>>>> +# </@LICENSE>
>>>>>> +#
>>>>>> +
>>>>>> +use strict;
>>>>>> +use warnings;
>>>>>> +
>>>>>> +=encoding utf-8
>>>>>> +
>>>>>> +=head1 NAME
>>>>>> +
>>>>>> +Mail::SpamAssassin::Pyzor::Digest::StripHtml
>>>>>> +
>>>>>> +=head1 SYNOPSIS
>>>>>> +
>>>>>> +    my $stripped = 
>>>>>> Mail::SpamAssassin::Pyzor::Digest::StripHtml::strip($html);
>>>>>> +
>>>>>> +=head1 DESCRIPTION
>>>>>> +
>>>>>> +This module attempts to duplicate pyzor???s HTML-stripping logic.
>>>>>> +
>>>>>> +=head1 ACCURACY
>>>>>> +
>>>>>> +This library cannot achieve 100%, bug-for-bug parity with pyzor
>>>>>> +because to do so would require duplicating Python???s own HTML parsing
>>>>>> +library. Since that library???s output has changed over time, and those
>>>>>> +changes in turn affect pyzor, it???s literally impossible to arrive at
>>>>>> +a single, fully-compatible reimplementation.
>>>>>> +
>>>>>> +That said, all known divergences between pyzor and this library involve
>>>>>> +invalid HTML as input.
>>>>>> +
>>>>>> +Please open bug reports for any divergences you identify, particularly
>>>>>> +if the input is valid HTML.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +use HTML::Parser ();
>>>>>> +
>>>>>> +our $VERSION = '0.03';
>>>>>> +
>>>>>> +#----------------------------------------------------------------------
>>>>>> +
>>>>>> +=head1 FUNCTIONS
>>>>>> +
>>>>>> +=head2 $stripped = strip( $HTML )
>>>>>> +
>>>>>> +Give it some HTML, and it???ll give back the stripped text.
>>>>>> +
>>>>>> +In B<general>, the stripping consists of removing tags as well as
>>>>>> +C<E<lt>scriptE<gt>> and C<E<lt>styleE<gt>> elements; however, it also
>>>>>> +removes HTML entities.
>>>>>> +
>>>>>> +This tries very hard to duplicate pyzor???s behavior with invalid HTML.
>>>>>> +
>>>>>> +=cut
>>>>>> +
>>>>>> +sub strip {
>>>>>> +    my ($html) = @_;
>>>>>> +
>>>>>> +    $html =~ s<\A\s+><>;
>>>>>> +    $html =~ s<\s+\z><>;
>>>>>> +
>>>>>> +    my $p = HTML::Parser->new( api_version => 3 );
>>>>>> +
>>>>>> +    my @pieces;
>>>>>> +
>>>>>> +    my $accumulate = 1;
>>>>>> +
>>>>>> +    $p->handler(
>>>>>> +        start => sub {
>>>>>> +            my ($tagname) = @_;
>>>>>> +
>>>>>> +            $accumulate = 0 if $tagname eq 'script';
>>>>>> +            $accumulate = 0 if $tagname eq 'style';
>>>>>> +
>>>>>> +            return;
>>>>>> +        },
>>>>>> +        'tagname',
>>>>>> +    );
>>>>>> +
>>>>>> +    $p->handler(
>>>>>> +        end => sub {
>>>>>> +            $accumulate = 1;
>>>>>> +            return;
>>>>>> +        }
>>>>>> +    );
>>>>>> +
>>>>>> +    $p->handler(
>>>>>> +        text => sub {
>>>>>> +            my ($copy) = @_;
>>>>>> +
>>>>>> +            return if !$accumulate;
>>>>>> +
>>>>>> +            # pyzor???s HTML parser discards HTML entities. On top of 
>>>>>> that,
>>>>>> +            # we need to match, as closely as possible, pyzor???s 
>>>>>> handling of
>>>>>> +            # invalid HTML entities ??? which is a function of 
>>>>>> Python???s
>>>>>> +            # standard HTML parsing library. This will probably never be
>>>>>> +            # fully compatible with the pyzor, but we can get it close.
>>>>>> +
>>>>>> +            # The original is:
>>>>>> +            #
>>>>>> +            #   re.compile('&#(?:[0-9]+|[xX][0-9a-fA-F]+)[^0-9a-fA-F]')
>>>>>> +            #
>>>>>> +            # The parsing loop then ???backs up??? one byte if the last
>>>>>> +            # character isn???t a ???;???. We use a look-ahead 
>>>>>> assertion to
>>>>>> +            # mimic that behavior.
>>>>>> +            $copy =~ s<\&\# (?:[0-9]+ | [xX][0-9a-fA-F]+) (?: ; | \z | 
>>>>>> (?=[^0-9a-fA-F]) )>< >gx;
>>>>>> +
>>>>>> +            # The original is:
>>>>>> +            #
>>>>>> +            #   re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')
>>>>>> +            #
>>>>>> +            # We again use a look-ahead assertion to mimic Python.
>>>>>> +            $copy =~ s<\& [a-zA-Z] [-.a-zA-Z0-9]* (?: ; | \z | 
>>>>>> (?=[^a-zA-Z0-9]) )>< >gx;
>>>>>> +
>>>>>> +            # Python???s HTMLParser aborts its parsing loop when it 
>>>>>> encounters
>>>>>> +            # an invalid numeric reference.
>>>>>> +            $copy =~ s<\&\#
>>>>>> +                (?:
>>>>>> +                    [^0-9xX]        # anything but the expected first 
>>>>>> char
>>>>>> +                    |
>>>>>> +                    [0-9]+[a-fA-F]  # hex within decimal
>>>>>> +                    |
>>>>>> +                    [xX][^0-9a-fA-F]
>>>>>> +                )
>>>>>> +                (.*)
>>>>>> +            ><
>>>>>> +                ( -1 == index($1, ';') ) ? q<> : '&#'
>>>>>> +            >exs;
>>>>>> +
>>>>>> +            # Python???s HTMLParser treats invalid entities as 
>>>>>> incomplete
>>>>>> +            $copy =~ s<(\&\#?)><$1 >gx;
>>>>>> +
>>>>>> +            $copy =~ s<\A\s+><>;
>>>>>> +            $copy =~ s<\s+\z><>;
>>>>>> +
>>>>>> +            push @pieces, \$copy if length $copy;
>>>>>> +        },
>>>>>> +        'text,tagname',
>>>>>> +    );
>>>>>> +
>>>>>> +    $p->parse($html);
>>>>>> +    $p->eof();
>>>>>> +
>>>>>> +    my $payload = join( q< >, map { $$_ } @pieces );
>>>>>> +
>>>>>> +    # Convert all sequences of whitespace OTHER THAN non-breaking 
>>>>>> spaces to
>>>>>> +    # plain spaces.
>>>>>> +    $payload =~ s<[^\S\x{a0}]+>< >g;
>>>>>> +
>>>>>> +    return $payload;
>>>>>> +}
>>>>>> +
>>>>>> +1;
>>>>>> diff --git a/t/pyzor.t b/t/pyzor.t
>>>>>> index 891f38d..e4ef83f 100755
>>>>>> --- a/t/pyzor.t
>>>>>> +++ b/t/pyzor.t
>>>>>> @@ -3,12 +3,9 @@
>>>>>>   use lib '.'; use lib 't';
>>>>>>   use SATest; sa_t_init("pyzor");
>>>>>> -use constant HAS_PYZOR => eval { $_ = untaint_cmd("which pyzor"); 
>>>>>> chomp; -x };
>>>>>> -
>>>>>>   use Test::More;
>>>>>>   plan skip_all => "Net tests disabled" unless 
>>>>>> conf_bool('run_net_tests');
>>>>>> -plan skip_all => "Pyzor executable not found in path" unless HAS_PYZOR;
>>>>>> -plan tests => 8;
>>>>>> +plan tests => 5;
>>>>>>   diag('Note: Failures may not be an SpamAssassin bug, as Pyzor tests 
>>>>>> can fail due to problems with the Pyzor servers.');
>>>>>> @@ -30,7 +27,7 @@ tstprefs ("
>>>>>>   sarun ("-t < data/spam/pyzor", \&patterns_run_cb);
>>>>>>   ok_all_patterns();
>>>>>>   # Same with fork
>>>>>> -sarun ("--cf='pyzor_fork 1' -t < data/spam/pyzor", \&patterns_run_cb);
>>>>>> +sarun ("-t < data/spam/pyzor", \&patterns_run_cb);
>>>>>>   ok_all_patterns();
>>>>>>   #TESTING FOR HAM
>>>>>> @@ -44,7 +41,3 @@ ok_all_patterns();
>>>>>>   sarun ("-D pyzor -t < data/nice/001 2>&1", \&patterns_run_cb);
>>>>>>   ok_all_patterns();
>>>>>> -# same with fork
>>>>>> -sarun ("-D pyzor --cf='pyzor_fork 1' -t < data/nice/001 2>&1", 
>>>>>> \&patterns_run_cb);
>>>>>> -ok_all_patterns();
>>>>>> -
>>>>>
>>>
>> -- 
>> Kevin A. McGrail
>> kmcgr...@apache.org
>>
>> Member, Apache Software Foundation
>> Chair Emeritus Apache SpamAssassin Project
>> https://www.linkedin.com/in/kmcgrail - 703.798.0171

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to