Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Henrik K Thu, 06 Jun 2019 23:52:05 -0700


What does "unreadable for rule analysis" mean?  Surely no one is actually
manually reading such lines one rule at a time?  Computers can check and
grep for you.. ;-)


I think this needs a little bit more of thought what we really want to
accomplish here and maybe do it in a bug along with the new templates and
stuff if needed..



On Thu, Jun 06, 2019 at 07:48:02AM -0400, Kevin A. McGrail wrote:
> That is a frightening one liner.  Should we use it?
> 
> As for the more output comment, if you have emails with 300 lower case e's, 
> you
> get 300 hits for the subtext.  It is unreadable for rule analysis.
> 
> As for modifying the normal output, I have no idea if anyone out there is 
> using
> the public routine so better to be safe.
> 
> I didn't find a tag for subtests either. That might be a good 4.0 addition.
> 
> Regards, KAM
> 
> On Thu, Jun 6, 2019, 01:30 Henrik K <[1]h...@hege.li> wrote:
> 
> 
>     Well in theory you see _more_ debug output now when there are no
>     duplicates,
>     due to the stats string..  honestly atleast I wouldn't care about that.
>     Feel free to vote.
> 
>     As a silly morning exercise, here's a one-liner that compacts stuff :-P
> 
>     my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
>     my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
> 
>     __A,__B,__C(3),__CC,__D(2),__E(2)
> 
> 
>     On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
>     > Good point, Henrik & John.
>     >
>     > OK, I've left the output alone except for the calls from dbg so it
>     > shouldn't break anything in the public interface.
>     >
>     > Thoughts on this version?
>     >
>     > Regards,
>     > KAM
>     >
>     > On 6/4/2019 1:51 PM, John Hardin wrote:
>     > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
>     > >
>     > >> Yes, I was thinking about that and wanting to fix uritests so well
>     > >> for the
>     > >> template.   Thanks for the feedback.  I will take another pass at it.
>     > >
>     > > Just do the deduplication without modifying the output format.
>     > >
>     > > If we want to log the hit counts, then make another function that does
>     > > what you did and use it for logging.
>     > >
>     > >
>     > >> On Tue, Jun 4, 2019, 03:23 Henrik K <[2]h...@hege.li> wrote:
>     > >>
>     > >>>
>     > >>> If you want to modify debug output, you have to modify only the 
> dbg()
>     > >>> output
>     > >>> itself.  You can't modify internal functions that have specific
>     output
>     > >>> formats and start adding random strings to them.  Atleast these
>     places
>     > >>> depend on the comma delimited rules:
>     > >>>
>     > >>> ./masses/mass-check:    push @tests, split(/,/,
>     > >>> $status->get_names_of_subtests_hit());
>     > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
>     > >>> split(/,/,$msg->get_names_of_tests_hit()),
>     > >>> split(/,/,$msg->get_names_of_subtests_hit());
>     > >>> ./t.rules/run:  my $testsline =
>     > >>> $status->get_names_of_tests_hit().",".$status->
>     get_names_of_subtests_hit();
>     > >>>
>     > >>>
>     > >>>
>     > >>>
>     > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
>     > >>>> Morning All,
>     > >>>>
>     > >>>> After a few thoughts on limits, it appears that any duplicate
>     subtest
>     > >>>> hits are best combined for debug output.
>     > >>>>
>     > >>>> Any thoughts on the attached?  It looks like it will help me with
>     rule
>     > >>>> development while support rules with valid but large maxhits like
>     > >>> __LOWER_E
>     > >>>>
>     > >>>> Regards,
>     > >>>> KAM
>     > >>>>
>     > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
>     > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>     > >>>>>
>     > >>>>>> I was curious if anyone noticed the debug output for subtests has
>     > >>> gotten
>     > >>>>>> insane:
>     > >>>>>
>     > >>>>> It got a little discussion on users@ when I created those rules.
>     > >>>>>
>     > >>>>> [...]
>     > >>>>>
>     > >>>>>> [3]72_active.cf:    body            __LOWER_E       /e/
>     > >>>>>> [4]72_active.cf:    tflags          __LOWER_E       multiple
>     > >>>>>> maxhits=230
>     > >>>>>>
>     > >>>>>> [5]72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>     > >>>>>> [6]72_active.cf:    tflags          __E_LIKE_LETTER multiple
>     > >>>>>> maxhits=320
>     > >>>>>>
>     > >>>>>> Assuming those maxhits are correct,
>     > >>>>>
>     > >>>>> They are. In fact they were carefully tuned to catch the targeted
>     > >>>>> extortion spam.
>     > >>>>>
>     > >>>>>> maybe we need something in the debug
>     > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>     > >>>>>
>     > >>>>> That would be a useful enhancement even without my flagrant log
>     > >>>>> vandalism.
>     > >>>>>
>     > >>>>
>     > >>>> --
>     > >>>> Kevin A. McGrail
>     > >>>> Member, Apache Software Foundation
>     > >>>> Chair Emeritus Apache SpamAssassin Project
>     > >>>> [7]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>     > >>>>
>     > >>>
>     > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>     > >>>> ===================================================================
>     > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>     > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>     > >>>> @@ -769,7 +769,38 @@
>     > >>>>  sub get_names_of_subtests_hit {
>     > >>>>    my ($self) = @_;
>     > >>>>
>     > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
>     > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
>     > >>>> +
>     > >>>> +  #This routine prints only one instance of a subrule hit with a
>     > >>>> count
>     > >>> of how many times it hit if greater than 1
>     > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, 
> $rule,
>     > >>> $total_hits, $deduplicated_hits);
>     > >>>> +
>     > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
>     > >>>> +
>     > >>>> +  for ($i=0; $i < $total_hits; $i++) {
>     > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
>     > >>>> +    $subtest_names_hit{$rule}++;
>     > >>>> +  }
>     > >>>> +
>     > >>>> +  foreach $key (keys %subtest_names_hit) {
>     > >>>> +    push (@keys, $key);
>     > >>>> +  }
>     > >>>> +  @sorted = sort @keys;
>     > >>>> +
>     > >>>> +  $deduplicated_hits = scalar(@sorted);
>     > >>>> +
>     > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
>     > >>>> +    $string .= $sorted[$i];
>     > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
>     > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
>     > >>>> +    }
>     > >>>> +    $string .= ",";
>     > >>>> +  }
>     > >>>> +
>     > >>>> +  $string =~ s/,$//;
>     > >>>> +
>     > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
>     Total
>     > >>> Hits: $deduplicated_hits)";
>     > >>>> +
>     > >>>> +  return $string;
>     > >>>>  }
>     > >>>>
>     > >>>>
>     > >>> 
> #####################################################################
>     ######
>     > >>>
>     > >>>
>     > >>>
>     > >>
>     > >
>     >
>     > --
>     > Kevin A. McGrail
>     > Member, Apache Software Foundation
>     > Chair Emeritus Apache SpamAssassin Project
>     > [8]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>     >
> 
>     > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>     > ===================================================================
>     > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>     > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>     > @@ -398,7 +398,7 @@
>     >    dbg("check: is spam? score=".$self->{score}.
>     >                          " required=".$self->{conf}->{required_score});
>     >    dbg("check: tests=".$self->get_names_of_tests_hit());
>     > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
>     > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
>     >    $self->{is_spam} = $self->is_spam();
>     > 
>     >    $self->{main}->{resolver}->bgabort();
>     > @@ -764,12 +764,52 @@
>     >  normally-hidden rules, which score 0 and have names beginning with two
>     >  underscores, used in meta rules.
>     > 
>     > +If a parameter of dbg is passed, the output will be more condensed and
>     > +sub-tests with multiple hits reduced to one entry with the number of
>     hits
>     > +in parentheses. Some information is also added at the end regarding the
>     > +multiple hits.
>     > +
>     >  =cut
>     > 
>     >  sub get_names_of_subtests_hit {
>     > -  my ($self) = @_;
>     > +  my ($self, $mode) = @_;
>     > 
>     > -  return join(',', sort @{$self->{subtest_names_hit}});
>     > +  if (defined $mode && $mode eq 'dbg') {
>     > +    #This routine prints only one instance of a subrule hit with a 
> count
>     of how many times it hit if greater than 1
>     > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>     $total_hits, $deduplicated_hits); 
>     > + 
>     > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
>     > + 
>     > +    for ($i=0; $i < $total_hits; $i++) {
>     > +      $rule = ${$self->{subtest_names_hit}}[$i];
>     > +      $subtest_names_hit{$rule}++;
>     > +    }
>     > + 
>     > +    foreach $key (keys %subtest_names_hit) {
>     > +      push (@keys, $key);
>     > +    }
>     > +    @sorted = sort @keys;
>     > + 
>     > +    $deduplicated_hits = scalar(@sorted);
>     > + 
>     > +    for ($i=0; $i < $deduplicated_hits; $i++) {
>     > +      $string .= $sorted[$i];
>     > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
>     > +        $string .= "($subtest_names_hit{$sorted[$i]})"
>     > +      }
>     > +      $string .= ",";
>     > +    }
>     > + 
>     > +    $string =~ s/,$//;
>     > + 
>     > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
>     Hits: $deduplicated_hits)";
>     > + 
>     > +    return $string;
>     > +
>     > +  } else {
>     > +    #return the simpler string with duplicates and commas
>     > +    return join(',', sort @{$self->{subtest_names_hit}});
>     > +  }
>     >  }
>     > 
>     >  
> ########################################################################
>     ###
> 
> 
> 
> On Thu, Jun 6, 2019, 01:30 Henrik K <[9]h...@hege.li> wrote:
> 
> 
>     Well in theory you see _more_ debug output now when there are no
>     duplicates,
>     due to the stats string..  honestly atleast I wouldn't care about that.
>     Feel free to vote.
> 
>     As a silly morning exercise, here's a one-liner that compacts stuff :-P
> 
>     my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
>     my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
> 
>     __A,__B,__C(3),__CC,__D(2),__E(2)
> 
> 
>     On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
>     > Good point, Henrik & John.
>     >
>     > OK, I've left the output alone except for the calls from dbg so it
>     > shouldn't break anything in the public interface.
>     >
>     > Thoughts on this version?
>     >
>     > Regards,
>     > KAM
>     >
>     > On 6/4/2019 1:51 PM, John Hardin wrote:
>     > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
>     > >
>     > >> Yes, I was thinking about that and wanting to fix uritests so well
>     > >> for the
>     > >> template.   Thanks for the feedback.  I will take another pass at it.
>     > >
>     > > Just do the deduplication without modifying the output format.
>     > >
>     > > If we want to log the hit counts, then make another function that does
>     > > what you did and use it for logging.
>     > >
>     > >
>     > >> On Tue, Jun 4, 2019, 03:23 Henrik K <[10]h...@hege.li> wrote:
>     > >>
>     > >>>
>     > >>> If you want to modify debug output, you have to modify only the 
> dbg()
>     > >>> output
>     > >>> itself.  You can't modify internal functions that have specific
>     output
>     > >>> formats and start adding random strings to them.  Atleast these
>     places
>     > >>> depend on the comma delimited rules:
>     > >>>
>     > >>> ./masses/mass-check:    push @tests, split(/,/,
>     > >>> $status->get_names_of_subtests_hit());
>     > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
>     > >>> split(/,/,$msg->get_names_of_tests_hit()),
>     > >>> split(/,/,$msg->get_names_of_subtests_hit());
>     > >>> ./t.rules/run:  my $testsline =
>     > >>> $status->get_names_of_tests_hit().",".$status->
>     get_names_of_subtests_hit();
>     > >>>
>     > >>>
>     > >>>
>     > >>>
>     > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
>     > >>>> Morning All,
>     > >>>>
>     > >>>> After a few thoughts on limits, it appears that any duplicate
>     subtest
>     > >>>> hits are best combined for debug output.
>     > >>>>
>     > >>>> Any thoughts on the attached?  It looks like it will help me with
>     rule
>     > >>>> development while support rules with valid but large maxhits like
>     > >>> __LOWER_E
>     > >>>>
>     > >>>> Regards,
>     > >>>> KAM
>     > >>>>
>     > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
>     > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>     > >>>>>
>     > >>>>>> I was curious if anyone noticed the debug output for subtests has
>     > >>> gotten
>     > >>>>>> insane:
>     > >>>>>
>     > >>>>> It got a little discussion on users@ when I created those rules.
>     > >>>>>
>     > >>>>> [...]
>     > >>>>>
>     > >>>>>> [11]72_active.cf:    body            __LOWER_E       /e/
>     > >>>>>> [12]72_active.cf:    tflags          __LOWER_E       multiple
>     > >>>>>> maxhits=230
>     > >>>>>>
>     > >>>>>> [13]72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>     > >>>>>> [14]72_active.cf:    tflags          __E_LIKE_LETTER multiple
>     > >>>>>> maxhits=320
>     > >>>>>>
>     > >>>>>> Assuming those maxhits are correct,
>     > >>>>>
>     > >>>>> They are. In fact they were carefully tuned to catch the targeted
>     > >>>>> extortion spam.
>     > >>>>>
>     > >>>>>> maybe we need something in the debug
>     > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>     > >>>>>
>     > >>>>> That would be a useful enhancement even without my flagrant log
>     > >>>>> vandalism.
>     > >>>>>
>     > >>>>
>     > >>>> --
>     > >>>> Kevin A. McGrail
>     > >>>> Member, Apache Software Foundation
>     > >>>> Chair Emeritus Apache SpamAssassin Project
>     > >>>> [15]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>     > >>>>
>     > >>>
>     > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>     > >>>> ===================================================================
>     > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>     > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>     > >>>> @@ -769,7 +769,38 @@
>     > >>>>  sub get_names_of_subtests_hit {
>     > >>>>    my ($self) = @_;
>     > >>>>
>     > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
>     > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
>     > >>>> +
>     > >>>> +  #This routine prints only one instance of a subrule hit with a
>     > >>>> count
>     > >>> of how many times it hit if greater than 1
>     > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, 
> $rule,
>     > >>> $total_hits, $deduplicated_hits);
>     > >>>> +
>     > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
>     > >>>> +
>     > >>>> +  for ($i=0; $i < $total_hits; $i++) {
>     > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
>     > >>>> +    $subtest_names_hit{$rule}++;
>     > >>>> +  }
>     > >>>> +
>     > >>>> +  foreach $key (keys %subtest_names_hit) {
>     > >>>> +    push (@keys, $key);
>     > >>>> +  }
>     > >>>> +  @sorted = sort @keys;
>     > >>>> +
>     > >>>> +  $deduplicated_hits = scalar(@sorted);
>     > >>>> +
>     > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
>     > >>>> +    $string .= $sorted[$i];
>     > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
>     > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
>     > >>>> +    }
>     > >>>> +    $string .= ",";
>     > >>>> +  }
>     > >>>> +
>     > >>>> +  $string =~ s/,$//;
>     > >>>> +
>     > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
>     Total
>     > >>> Hits: $deduplicated_hits)";
>     > >>>> +
>     > >>>> +  return $string;
>     > >>>>  }
>     > >>>>
>     > >>>>
>     > >>> 
> #####################################################################
>     ######
>     > >>>
>     > >>>
>     > >>>
>     > >>
>     > >
>     >
>     > --
>     > Kevin A. McGrail
>     > Member, Apache Software Foundation
>     > Chair Emeritus Apache SpamAssassin Project
>     > [16]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>     >
> 
>     > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>     > ===================================================================
>     > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>     > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>     > @@ -398,7 +398,7 @@
>     >    dbg("check: is spam? score=".$self->{score}.
>     >                          " required=".$self->{conf}->{required_score});
>     >    dbg("check: tests=".$self->get_names_of_tests_hit());
>     > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
>     > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
>     >    $self->{is_spam} = $self->is_spam();
>     > 
>     >    $self->{main}->{resolver}->bgabort();
>     > @@ -764,12 +764,52 @@
>     >  normally-hidden rules, which score 0 and have names beginning with two
>     >  underscores, used in meta rules.
>     > 
>     > +If a parameter of dbg is passed, the output will be more condensed and
>     > +sub-tests with multiple hits reduced to one entry with the number of
>     hits
>     > +in parentheses. Some information is also added at the end regarding the
>     > +multiple hits.
>     > +
>     >  =cut
>     > 
>     >  sub get_names_of_subtests_hit {
>     > -  my ($self) = @_;
>     > +  my ($self, $mode) = @_;
>     > 
>     > -  return join(',', sort @{$self->{subtest_names_hit}});
>     > +  if (defined $mode && $mode eq 'dbg') {
>     > +    #This routine prints only one instance of a subrule hit with a 
> count
>     of how many times it hit if greater than 1
>     > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>     $total_hits, $deduplicated_hits); 
>     > + 
>     > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
>     > + 
>     > +    for ($i=0; $i < $total_hits; $i++) {
>     > +      $rule = ${$self->{subtest_names_hit}}[$i];
>     > +      $subtest_names_hit{$rule}++;
>     > +    }
>     > + 
>     > +    foreach $key (keys %subtest_names_hit) {
>     > +      push (@keys, $key);
>     > +    }
>     > +    @sorted = sort @keys;
>     > + 
>     > +    $deduplicated_hits = scalar(@sorted);
>     > + 
>     > +    for ($i=0; $i < $deduplicated_hits; $i++) {
>     > +      $string .= $sorted[$i];
>     > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
>     > +        $string .= "($subtest_names_hit{$sorted[$i]})"
>     > +      }
>     > +      $string .= ",";
>     > +    }
>     > + 
>     > +    $string =~ s/,$//;
>     > + 
>     > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
>     Hits: $deduplicated_hits)";
>     > + 
>     > +    return $string;
>     > +
>     > +  } else {
>     > +    #return the simpler string with duplicates and commas
>     > +    return join(',', sort @{$self->{subtest_names_hit}});
>     > +  }
>     >  }
>     > 
>     >  
> ########################################################################
>     ###
> 
> 
> 
> References:
> 
> [1] mailto:h...@hege.li
> [2] mailto:h...@hege.li
> [3] http://72_active.cf/
> [4] http://72_active.cf/
> [5] http://72_active.cf/
> [6] http://72_active.cf/
> [7] https://www.linkedin.com/in/kmcgrail
> [8] https://www.linkedin.com/in/kmcgrail
> [9] mailto:h...@hege.li
> [10] mailto:h...@hege.li
> [11] http://72_active.cf/
> [12] http://72_active.cf/
> [13] http://72_active.cf/
> [14] http://72_active.cf/
> [15] https://www.linkedin.com/in/kmcgrail
> [16] https://www.linkedin.com/in/kmcgrail

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Reply via email to