Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Kevin A. McGrail Thu, 06 Jun 2019 04:48:26 -0700

That is a frightening one liner.  Should we use it?

As for the more output comment, if you have emails with 300 lower case e's,
you get 300 hits for the subtext.  It is unreadable for rule analysis.


As for modifying the normal output, I have no idea if anyone out there is
using the public routine so better to be safe.

I didn't find a tag for subtests either. That might be a good 4.0 addition.

Regards, KAM

On Thu, Jun 6, 2019, 01:30 Henrik K <h...@hege.li> wrote:

>
> Well in theory you see _more_ debug output now when there are no
> duplicates,
> due to the stats string..  honestly atleast I wouldn't care about that.
> Feel free to vote.
>
> As a silly morning exercise, here's a one-liner that compacts stuff :-P
>
> my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
> my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
>
> __A,__B,__C(3),__CC,__D(2),__E(2)
>
>
> On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
> > Good point, Henrik & John.
> >
> > OK, I've left the output alone except for the calls from dbg so it
> > shouldn't break anything in the public interface.
> >
> > Thoughts on this version?
> >
> > Regards,
> > KAM
> >
> > On 6/4/2019 1:51 PM, John Hardin wrote:
> > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
> > >
> > >> Yes, I was thinking about that and wanting to fix uritests so well
> > >> for the
> > >> template.   Thanks for the feedback.  I will take another pass at it.
> > >
> > > Just do the deduplication without modifying the output format.
> > >
> > > If we want to log the hit counts, then make another function that does
> > > what you did and use it for logging.
> > >
> > >
> > >> On Tue, Jun 4, 2019, 03:23 Henrik K <h...@hege.li> wrote:
> > >>
> > >>>
> > >>> If you want to modify debug output, you have to modify only the dbg()
> > >>> output
> > >>> itself.  You can't modify internal functions that have specific
> output
> > >>> formats and start adding random strings to them.  Atleast these
> places
> > >>> depend on the comma delimited rules:
> > >>>
> > >>> ./masses/mass-check:    push @tests, split(/,/,
> > >>> $status->get_names_of_subtests_hit());
> > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
> > >>> split(/,/,$msg->get_names_of_tests_hit()),
> > >>> split(/,/,$msg->get_names_of_subtests_hit());
> > >>> ./t.rules/run:  my $testsline =
> > >>>
> $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
> > >>>> Morning All,
> > >>>>
> > >>>> After a few thoughts on limits, it appears that any duplicate
> subtest
> > >>>> hits are best combined for debug output.
> > >>>>
> > >>>> Any thoughts on the attached?  It looks like it will help me with
> rule
> > >>>> development while support rules with valid but large maxhits like
> > >>> __LOWER_E
> > >>>>
> > >>>> Regards,
> > >>>> KAM
> > >>>>
> > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
> > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
> > >>>>>
> > >>>>>> I was curious if anyone noticed the debug output for subtests has
> > >>> gotten
> > >>>>>> insane:
> > >>>>>
> > >>>>> It got a little discussion on users@ when I created those rules.
> > >>>>>
> > >>>>> [...]
> > >>>>>
> > >>>>>> 72_active.cf:    body            __LOWER_E       /e/
> > >>>>>> 72_active.cf:    tflags          __LOWER_E       multiple
> > >>>>>> maxhits=230
> > >>>>>>
> > >>>>>> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> > >>>>>> 72_active.cf:    tflags          __E_LIKE_LETTER multiple
> > >>>>>> maxhits=320
> > >>>>>>
> > >>>>>> Assuming those maxhits are correct,
> > >>>>>
> > >>>>> They are. In fact they were carefully tuned to catch the targeted
> > >>>>> extortion spam.
> > >>>>>
> > >>>>>> maybe we need something in the debug
> > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
> > >>>>>
> > >>>>> That would be a useful enhancement even without my flagrant log
> > >>>>> vandalism.
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Kevin A. McGrail
> > >>>> Member, Apache Software Foundation
> > >>>> Chair Emeritus Apache SpamAssassin Project
> > >>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> > >>>>
> > >>>
> > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> > >>>> ===================================================================
> > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> > >>>> @@ -769,7 +769,38 @@
> > >>>>  sub get_names_of_subtests_hit {
> > >>>>    my ($self) = @_;
> > >>>>
> > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
> > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
> > >>>> +
> > >>>> +  #This routine prints only one instance of a subrule hit with a
> > >>>> count
> > >>> of how many times it hit if greater than 1
> > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> > >>> $total_hits, $deduplicated_hits);
> > >>>> +
> > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
> > >>>> +
> > >>>> +  for ($i=0; $i < $total_hits; $i++) {
> > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
> > >>>> +    $subtest_names_hit{$rule}++;
> > >>>> +  }
> > >>>> +
> > >>>> +  foreach $key (keys %subtest_names_hit) {
> > >>>> +    push (@keys, $key);
> > >>>> +  }
> > >>>> +  @sorted = sort @keys;
> > >>>> +
> > >>>> +  $deduplicated_hits = scalar(@sorted);
> > >>>> +
> > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
> > >>>> +    $string .= $sorted[$i];
> > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
> > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
> > >>>> +    }
> > >>>> +    $string .= ",";
> > >>>> +  }
> > >>>> +
> > >>>> +  $string =~ s/,$//;
> > >>>> +
> > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
> Total
> > >>> Hits: $deduplicated_hits)";
> > >>>> +
> > >>>> +  return $string;
> > >>>>  }
> > >>>>
> > >>>>
> > >>>
> ###########################################################################
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
> > --
> > Kevin A. McGrail
> > Member, Apache Software Foundation
> > Chair Emeritus Apache SpamAssassin Project
> > https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >
>
> > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> > ===================================================================
> > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> > @@ -398,7 +398,7 @@
> >    dbg("check: is spam? score=".$self->{score}.
> >                          " required=".$self->{conf}->{required_score});
> >    dbg("check: tests=".$self->get_names_of_tests_hit());
> > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
> > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
> >    $self->{is_spam} = $self->is_spam();
> >
> >    $self->{main}->{resolver}->bgabort();
> > @@ -764,12 +764,52 @@
> >  normally-hidden rules, which score 0 and have names beginning with two
> >  underscores, used in meta rules.
> >
> > +If a parameter of dbg is passed, the output will be more condensed and
> > +sub-tests with multiple hits reduced to one entry with the number of
> hits
> > +in parentheses. Some information is also added at the end regarding the
> > +multiple hits.
> > +
> >  =cut
> >
> >  sub get_names_of_subtests_hit {
> > -  my ($self) = @_;
> > +  my ($self, $mode) = @_;
> >
> > -  return join(',', sort @{$self->{subtest_names_hit}});
> > +  if (defined $mode && $mode eq 'dbg') {
> > +    #This routine prints only one instance of a subrule hit with a
> count of how many times it hit if greater than 1
> > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> $total_hits, $deduplicated_hits);
> > +
> > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
> > +
> > +    for ($i=0; $i < $total_hits; $i++) {
> > +      $rule = ${$self->{subtest_names_hit}}[$i];
> > +      $subtest_names_hit{$rule}++;
> > +    }
> > +
> > +    foreach $key (keys %subtest_names_hit) {
> > +      push (@keys, $key);
> > +    }
> > +    @sorted = sort @keys;
> > +
> > +    $deduplicated_hits = scalar(@sorted);
> > +
> > +    for ($i=0; $i < $deduplicated_hits; $i++) {
> > +      $string .= $sorted[$i];
> > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
> > +        $string .= "($subtest_names_hit{$sorted[$i]})"
> > +      }
> > +      $string .= ",";
> > +    }
> > +
> > +    $string =~ s/,$//;
> > +
> > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
> Hits: $deduplicated_hits)";
> > +
> > +    return $string;
> > +
> > +  } else {
> > +    #return the simpler string with duplicates and commas
> > +    return join(',', sort @{$self->{subtest_names_hit}});
> > +  }
> >  }
> >
> >
> ###########################################################################
>
>
On Thu, Jun 6, 2019, 01:30 Henrik K <h...@hege.li> wrote:

>
> Well in theory you see _more_ debug output now when there are no
> duplicates,
> due to the stats string..  honestly atleast I wouldn't care about that.
> Feel free to vote.
>
> As a silly morning exercise, here's a one-liner that compacts stuff :-P
>
> my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
> my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
>
> __A,__B,__C(3),__CC,__D(2),__E(2)
>
>
> On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
> > Good point, Henrik & John.
> >
> > OK, I've left the output alone except for the calls from dbg so it
> > shouldn't break anything in the public interface.
> >
> > Thoughts on this version?
> >
> > Regards,
> > KAM
> >
> > On 6/4/2019 1:51 PM, John Hardin wrote:
> > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
> > >
> > >> Yes, I was thinking about that and wanting to fix uritests so well
> > >> for the
> > >> template.   Thanks for the feedback.  I will take another pass at it.
> > >
> > > Just do the deduplication without modifying the output format.
> > >
> > > If we want to log the hit counts, then make another function that does
> > > what you did and use it for logging.
> > >
> > >
> > >> On Tue, Jun 4, 2019, 03:23 Henrik K <h...@hege.li> wrote:
> > >>
> > >>>
> > >>> If you want to modify debug output, you have to modify only the dbg()
> > >>> output
> > >>> itself.  You can't modify internal functions that have specific
> output
> > >>> formats and start adding random strings to them.  Atleast these
> places
> > >>> depend on the comma delimited rules:
> > >>>
> > >>> ./masses/mass-check:    push @tests, split(/,/,
> > >>> $status->get_names_of_subtests_hit());
> > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
> > >>> split(/,/,$msg->get_names_of_tests_hit()),
> > >>> split(/,/,$msg->get_names_of_subtests_hit());
> > >>> ./t.rules/run:  my $testsline =
> > >>>
> $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
> > >>>> Morning All,
> > >>>>
> > >>>> After a few thoughts on limits, it appears that any duplicate
> subtest
> > >>>> hits are best combined for debug output.
> > >>>>
> > >>>> Any thoughts on the attached?  It looks like it will help me with
> rule
> > >>>> development while support rules with valid but large maxhits like
> > >>> __LOWER_E
> > >>>>
> > >>>> Regards,
> > >>>> KAM
> > >>>>
> > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
> > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
> > >>>>>
> > >>>>>> I was curious if anyone noticed the debug output for subtests has
> > >>> gotten
> > >>>>>> insane:
> > >>>>>
> > >>>>> It got a little discussion on users@ when I created those rules.
> > >>>>>
> > >>>>> [...]
> > >>>>>
> > >>>>>> 72_active.cf:    body            __LOWER_E       /e/
> > >>>>>> 72_active.cf:    tflags          __LOWER_E       multiple
> > >>>>>> maxhits=230
> > >>>>>>
> > >>>>>> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> > >>>>>> 72_active.cf:    tflags          __E_LIKE_LETTER multiple
> > >>>>>> maxhits=320
> > >>>>>>
> > >>>>>> Assuming those maxhits are correct,
> > >>>>>
> > >>>>> They are. In fact they were carefully tuned to catch the targeted
> > >>>>> extortion spam.
> > >>>>>
> > >>>>>> maybe we need something in the debug
> > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
> > >>>>>
> > >>>>> That would be a useful enhancement even without my flagrant log
> > >>>>> vandalism.
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Kevin A. McGrail
> > >>>> Member, Apache Software Foundation
> > >>>> Chair Emeritus Apache SpamAssassin Project
> > >>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> > >>>>
> > >>>
> > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> > >>>> ===================================================================
> > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> > >>>> @@ -769,7 +769,38 @@
> > >>>>  sub get_names_of_subtests_hit {
> > >>>>    my ($self) = @_;
> > >>>>
> > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
> > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
> > >>>> +
> > >>>> +  #This routine prints only one instance of a subrule hit with a
> > >>>> count
> > >>> of how many times it hit if greater than 1
> > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> > >>> $total_hits, $deduplicated_hits);
> > >>>> +
> > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
> > >>>> +
> > >>>> +  for ($i=0; $i < $total_hits; $i++) {
> > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
> > >>>> +    $subtest_names_hit{$rule}++;
> > >>>> +  }
> > >>>> +
> > >>>> +  foreach $key (keys %subtest_names_hit) {
> > >>>> +    push (@keys, $key);
> > >>>> +  }
> > >>>> +  @sorted = sort @keys;
> > >>>> +
> > >>>> +  $deduplicated_hits = scalar(@sorted);
> > >>>> +
> > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
> > >>>> +    $string .= $sorted[$i];
> > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
> > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
> > >>>> +    }
> > >>>> +    $string .= ",";
> > >>>> +  }
> > >>>> +
> > >>>> +  $string =~ s/,$//;
> > >>>> +
> > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
> Total
> > >>> Hits: $deduplicated_hits)";
> > >>>> +
> > >>>> +  return $string;
> > >>>>  }
> > >>>>
> > >>>>
> > >>>
> ###########################################################################
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
> > --
> > Kevin A. McGrail
> > Member, Apache Software Foundation
> > Chair Emeritus Apache SpamAssassin Project
> > https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >
>
> > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> > ===================================================================
> > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> > @@ -398,7 +398,7 @@
> >    dbg("check: is spam? score=".$self->{score}.
> >                          " required=".$self->{conf}->{required_score});
> >    dbg("check: tests=".$self->get_names_of_tests_hit());
> > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
> > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
> >    $self->{is_spam} = $self->is_spam();
> >
> >    $self->{main}->{resolver}->bgabort();
> > @@ -764,12 +764,52 @@
> >  normally-hidden rules, which score 0 and have names beginning with two
> >  underscores, used in meta rules.
> >
> > +If a parameter of dbg is passed, the output will be more condensed and
> > +sub-tests with multiple hits reduced to one entry with the number of
> hits
> > +in parentheses. Some information is also added at the end regarding the
> > +multiple hits.
> > +
> >  =cut
> >
> >  sub get_names_of_subtests_hit {
> > -  my ($self) = @_;
> > +  my ($self, $mode) = @_;
> >
> > -  return join(',', sort @{$self->{subtest_names_hit}});
> > +  if (defined $mode && $mode eq 'dbg') {
> > +    #This routine prints only one instance of a subrule hit with a
> count of how many times it hit if greater than 1
> > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> $total_hits, $deduplicated_hits);
> > +
> > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
> > +
> > +    for ($i=0; $i < $total_hits; $i++) {
> > +      $rule = ${$self->{subtest_names_hit}}[$i];
> > +      $subtest_names_hit{$rule}++;
> > +    }
> > +
> > +    foreach $key (keys %subtest_names_hit) {
> > +      push (@keys, $key);
> > +    }
> > +    @sorted = sort @keys;
> > +
> > +    $deduplicated_hits = scalar(@sorted);
> > +
> > +    for ($i=0; $i < $deduplicated_hits; $i++) {
> > +      $string .= $sorted[$i];
> > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
> > +        $string .= "($subtest_names_hit{$sorted[$i]})"
> > +      }
> > +      $string .= ",";
> > +    }
> > +
> > +    $string =~ s/,$//;
> > +
> > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
> Hits: $deduplicated_hits)";
> > +
> > +    return $string;
> > +
> > +  } else {
> > +    #return the simpler string with duplicates and commas
> > +    return join(',', sort @{$self->{subtest_names_hit}});
> > +  }
> >  }
> >
> >
> ###########################################################################
>
>

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Reply via email to