Cc'ing David, since I don't know if he is subscribed to dev@.

On Fri, 2014-05-30 at 14:24 -0400, Joe Quinn wrote:
> On 5/22/2014 7:48 PM, Karsten Bräckelmann wrote:
> > On Thu, 2014-05-22 at 18:34 -0500, David B Funk wrote:
> >> After doing some experimenting with that code I came up with something that
> >> I'd argue is more semantically correct:
> >>
> >>       # if we've got a long series of blank lines, limit them
> >>       if (defined $start) {
> >>         my $max_blank_lines = 20;
> >>         my $num = $start-$cnt;
> >>         if ($num > $max_blank_lines) {
> >>           splice @message, $cnt+2, $num-$max_blank_lines;
> >>         }
> >>         undef $start;
> >>       }
> >>
> >> IE limit a message to no more than "$max_blank_lines" in a row, not the 
> >> total
> >> collapse of more than 11. (adjust $max_blank_lines as you see fit or make 
> >> it
> >> a configurable parameter).
> > +1
> >
> > Can you file a bug report or raise the topic in dev@ list? The code
> > change is sufficiently simple, but I want that issue discussed first.
> > Wonder what's the reason for that collapsing in the first place.
> 
> 
> Forwarding this idea to the dev list for discussion.
> 
> jm wrote the original version of this code for this bug: 
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3712

Thanks Joe, good catch.

That bug still is limited to the Security group though. Given that the
bug has been fixed since 3.1 (and even 3.0.x), I'll open it to the
public. Due to limited severeness and occurring with bad third-party
rules only, this even has been mentioned back those days in comment 24.


> I see no issue with the change, given as this code is intended to trim 
> down thousands of newlines. I haven't tested it yet, though.

Yeah, reading the full bug discussion, that rewriting of consecutive
newlines was slightly exaggerated. In a nutshell:

* Bad regex patterns using unbound /.*/ matching and 10th of thousands
of newlines lead to huge memory usage. Bad REs are the real cause, which
needed to be fixed anyway.

* Limiting consecutive newlines to about 100 has been proposed in that
bug. Instead, a really low limit of 10 has been implemented.

* As seen in two recent threads on the users@ list, the actual total
collapsing is counter-productive. As David pointed out, the changed
empty line handling probably was not intended in the first place:
Instead of limiting the number of empty lines, it completely flames more
than 10 empty lines, shrinking it to 2.


I am in favor of changing that code as David proposed: Shrink lots of
consecutive newlines to an upper limit, removing only excess lines.

20 lines seems reasonable to me. 10 is definitely too low, whereas 100
feels slightly excessive.

The proposed code changes are sufficiently simple to not require a CLA.


> I am already seeing spammers adapt to this idea, and they are padding 
> the empty lines with single '/' and '\' characters.

Coincidence, I assume.

Padding empty lines makes it easy to detect again. The whole point is
that without padding, huge vertical space becomes very hard and costly
to detect in rules. So what you observe would be spammers abandoning a
working detection evasion technique, in favor of cheap regex rules
identifying them... Unlikely. ;)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to