Re: Evasion with Unicode format characters
On 30 Oct 2018, at 7:07, Cedric Knight wrote: I'd be grateful for advice as to whether there's merit in filing these concerns as one or more issues on Bugzilla, or for relevant background. I do not believe the codebase is the place to address these issues, which are addressable in carefully created rules. Because your approach would hide useful data patterns from rules, it is exactly the wrong way to go about "solving" a problem with a novel flavor of spam. As John & Kevin have noted, they have worked on the specific case of the extortion spams in publicly available rules. I also have an ancient bundle of rules that I've been adjusting for the modern world and existence outside of my idiosyncratic environment (where severe FPs are evaded/mitigated) which is promising and will be public in some way soon. Also, change this substantial in the core behavior of SA would be almost certain to NOT get into 3.4.3, which will be out soon and is likely to be dominant in production systems for some time despite the (coming soon) 4.0 release. If this were done in code rather than in rules, it would never be usable for sites not ready or able to go to 4.0 -- Bill Cole b...@scconsult.com or billc...@apache.org (AKA @grumpybozo and many *@billmail.scconsult.com addresses) Available For Hire: https://linkedin.com/in/billcole
[Bug 7270] TxRep SQL duplicate entry errors in log
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7270 Giovanni Bechis changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |WORKSFORME --- Comment #3 from Giovanni Bechis --- On the database, primary key is made by "username,email,signedby,ip", a duplicate key is possible only by submitting to SA the same email more than once. -- You are receiving this mail because: You are the assignee for the bug.
Re: Tons of errors in todays masscheck
I'd say the minute we get 3.4.3 out the door. I sent you a private email to see if we can chat about the big blocker for that? -- Kevin A. McGrail VP Fundraising, Apache Software Foundation Chair Emeritus Apache SpamAssassin Project https://www.linkedin.com/in/kmcgrail - 703.798.0171 On Tue, Oct 30, 2018 at 2:06 PM Henrik K wrote: > > On that note, does anyone want to entertain a timeline when a separate > "stable" 4.0.0 branch would be created? It's kind of hard to test things > in > trunk, when it's also used for masschecks daily. > > On Tue, Oct 30, 2018 at 08:03:05PM +0200, Henrik K wrote: > > > > Sorry that was some of my trunk thingies, fixed it already today.. > > > > On Tue, Oct 30, 2018 at 06:01:22PM +0100, Axb wrote: > > > Seems he's missing: > > > > > > echo "bayes_auto_learn 0" > spamassassin/user_prefs > > > echo "use_bayes 0" >> spamassassin/user_prefs > > > > > > in his masscheck script > > > > > > On 10/30/18 4:53 PM, Kevin A. McGrail wrote: > > > >Not good... emailing more lists... > > > > > > > >On 10/30/2018 11:15 AM, Jari Fredriksson wrote: > > > >>I had a kilometre long feedback mail from masscheckworker which > ended like this. > > > >> > > > >> > > > >>locker: creating link > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > to > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698 > failed: File exists at > /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > line 91. > > > >>locker: creating link > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > to > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694 > failed: File exists at > /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > line 91. > > > >>locker: creating link > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > to > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697 > failed: File exists at > /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > line 91. > > > >>locker: creating link > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > to > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696 > failed: File exists at > /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > line 91. > > > >>11:09:51 up 1:09, 0 users, load average: 14.76, 14.85, 14.35 > > > >>rsync -Pcqz ham-jarif.log spam-jarif.log *munged*/ > > > >>11:10:06 up 1:09, 0 users, load average: 11.64, 14.16, 14.13 > > > >> > > > > >
Re: Tons of errors in todays masscheck
On that note, does anyone want to entertain a timeline when a separate "stable" 4.0.0 branch would be created? It's kind of hard to test things in trunk, when it's also used for masschecks daily. On Tue, Oct 30, 2018 at 08:03:05PM +0200, Henrik K wrote: > > Sorry that was some of my trunk thingies, fixed it already today.. > > On Tue, Oct 30, 2018 at 06:01:22PM +0100, Axb wrote: > > Seems he's missing: > > > > echo "bayes_auto_learn 0" > spamassassin/user_prefs > > echo "use_bayes 0" >> spamassassin/user_prefs > > > > in his masscheck script > > > > On 10/30/18 4:53 PM, Kevin A. McGrail wrote: > > >Not good... emailing more lists... > > > > > >On 10/30/2018 11:15 AM, Jari Fredriksson wrote: > > >>I had a kilometre long feedback mail from masscheckworker which ended > > >>like this. > > >> > > >> > > >>locker: creating link > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > > >> to > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698 > > >> failed: File exists at > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > > >> line 91. > > >>locker: creating link > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > > >> to > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694 > > >> failed: File exists at > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > > >> line 91. > > >>locker: creating link > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > > >> to > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697 > > >> failed: File exists at > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > > >> line 91. > > >>locker: creating link > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > > >> to > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696 > > >> failed: File exists at > > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > > >> line 91. > > >>11:09:51 up 1:09, 0 users, load average: 14.76, 14.85, 14.35 > > >>rsync -Pcqz ham-jarif.log spam-jarif.log *munged*/ > > >>11:10:06 up 1:09, 0 users, load average: 11.64, 14.16, 14.13 > > >> > > >
Re: Tons of errors in todays masscheck
Sorry that was some of my trunk thingies, fixed it already today.. On Tue, Oct 30, 2018 at 06:01:22PM +0100, Axb wrote: > Seems he's missing: > > echo "bayes_auto_learn 0" > spamassassin/user_prefs > echo "use_bayes 0" >> spamassassin/user_prefs > > in his masscheck script > > On 10/30/18 4:53 PM, Kevin A. McGrail wrote: > >Not good... emailing more lists... > > > >On 10/30/2018 11:15 AM, Jari Fredriksson wrote: > >>I had a kilometre long feedback mail from masscheckworker which ended like > >>this. > >> > >> > >>locker: creating link > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > >>to > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698 > >> failed: File exists at > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > >> line 91. > >>locker: creating link > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > >>to > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694 > >> failed: File exists at > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > >> line 91. > >>locker: creating link > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > >>to > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697 > >> failed: File exists at > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > >> line 91. > >>locker: creating link > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > >>to > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696 > >> failed: File exists at > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > >> line 91. > >>11:09:51 up 1:09, 0 users, load average: 14.76, 14.85, 14.35 > >>rsync -Pcqz ham-jarif.log spam-jarif.log *munged*/ > >>11:10:06 up 1:09, 0 users, load average: 11.64, 14.16, 14.13 > >> > >
Re: Tons of errors in todays masscheck
Seems he's missing: echo "bayes_auto_learn 0" > spamassassin/user_prefs echo "use_bayes 0" >> spamassassin/user_prefs in his masscheck script On 10/30/18 4:53 PM, Kevin A. McGrail wrote: Not good... emailing more lists... On 10/30/2018 11:15 AM, Jari Fredriksson wrote: I had a kilometre long feedback mail from masscheckworker which ended like this. locker: creating link /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock to /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698 failed: File exists at /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm line 91. locker: creating link /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock to /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694 failed: File exists at /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm line 91. locker: creating link /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock to /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697 failed: File exists at /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm line 91. locker: creating link /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock to /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696 failed: File exists at /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm line 91. 11:09:51 up 1:09, 0 users, load average: 14.76, 14.85, 14.35 rsync -Pcqz ham-jarif.log spam-jarif.log *munged*/ 11:10:06 up 1:09, 0 users, load average: 11.64, 14.16, 14.13
Re: Tons of errors in todays masscheck
Not good... emailing more lists... On 10/30/2018 11:15 AM, Jari Fredriksson wrote: > I had a kilometre long feedback mail from masscheckworker which ended like > this. > > > locker: creating link > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > to > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698 > failed: File exists at > /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > line 91. > locker: creating link > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > to > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694 > failed: File exists at > /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > line 91. > locker: creating link > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > to > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697 > failed: File exists at > /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > line 91. > locker: creating link > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock > to > /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696 > failed: File exists at > /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm > line 91. > 11:09:51 up 1:09, 0 users, load average: 14.76, 14.85, 14.35 > rsync -Pcqz ham-jarif.log spam-jarif.log *munged*/ > 11:10:06 up 1:09, 0 users, load average: 11.64, 14.16, 14.13 > -- Kevin A. McGrail VP Fundraising, Apache Software Foundation Chair Emeritus Apache SpamAssassin Project https://www.linkedin.com/in/kmcgrail - 703.798.0171
[Bug 6728] DNSBLs need a way to turn off queries based on BLOCKED rules triggering
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6728 --- Comment #32 from Henrik Krohns --- (In reply to Kevin A. McGrail from comment #31) > Hah, I was just about to ask. Do you have it described in the UPGRADE and > release notes? It's mentioned in UPGRADE and of course the option is perldocced. It's more of an internal option anyway, doesn't make much sense to use beyond our own update channel. Good luck creating 4.0.0 release notes from scratch, I think even UPGRADE is missing years of changes. :-D -- You are receiving this mail because: You are the assignee for the bug.
[Bug 6728] DNSBLs need a way to turn off queries based on BLOCKED rules triggering
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6728 --- Comment #31 from Kevin A. McGrail --- Hah, I was just about to ask. Do you have it described in the UPGRADE and release notes? -- You are receiving this mail because: You are the assignee for the bug.
[Bug 6728] DNSBLs need a way to turn off queries based on BLOCKED rules triggering
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6728 Henrik Krohns changed: What|Removed |Added Target Milestone|Undefined |4.0.0 --- Comment #30 from Henrik Krohns --- Just to clarify, it will be in 4.0.0. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 6728] DNSBLs need a way to turn off queries based on BLOCKED rules triggering
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6728 Henrik Krohns changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #29 from Henrik Krohns --- After few littles fixes, I consider dns_block_rule now working. It's also committed to rules. Resolving. -- You are receiving this mail because: You are the assignee for the bug.
Re: Evasion with Unicode format characters
I've been looking at Zero-Width chars and the evasion. Look at KAM.cf and search ZWNJ and KAM_CRIM rules and see if it helps. -- Kevin A. McGrail VP Fundraising, Apache Software Foundation Chair Emeritus Apache SpamAssassin Project https://www.linkedin.com/in/kmcgrail - 703.798.0171 On Tue, Oct 30, 2018 at 7:07 AM Cedric Knight wrote: > Hello > > I thought of submitting a patch via Bugzilla, but then decided to first > ask and check that I understood the general principles of body checks, > and SpamAssassin's current approach to Unicode. Apologies for the length > of this message. I hope the main points make sense. > > A fair number of webcam bitcoin 'sextortion' scams have evaded detection > and worried recipients because of including relevant credentials. > (Incidentally, I assume the credentials and addresses are mostly from > the 2012 LinkedIn breach, but someone on the RIPE abuse list reports > Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of > this spam, but on writing body regexes to catch the wave around 16 > October, I noticed that my rules weren't matching because the source was > liberally injected with invisible characters: > Content preview: I am aware blabla is one of > your pass. Lets get straight > to point. Not one > > These characters are encoded as decimal HTML entities and in the > text/plain part as UTF-8 byte sequences. > > Without working these characters into a body rule pattern, that pattern > will not match, yet such Unicode 'format' characters barely affect > display or legibility, if at all. This could be a more general concern > about obfuscation. Invisible characters could be used to evade all the > ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format' > characters in Unicode: > > https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format > :] > I find it counterintuitive that such non-printing characters match > [:print:] and [:graph:] rather than [:cntrl:], but this is how the > classes are defined at: > https://www.unicode.org/reports/tr18/#Compatibility_Properties > > As minor points, 'Format' excludes a couple of separator characters in > the same range that instead match [:space:] > > https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character > :] > Then there is the C1 [:cntrl:] set, which some MUA's may render > silently, I think including the 0x9D matched by the recent > __UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?): > > https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control > :] > Finally, there may be a case for including as 'almost' invisible narrow > blanks like U+200A U+202F and maybe U+205F. The Perl Unicode > database may not be completely up-to-date here, and Perl 5.18 doesn't > recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24 > does. > > I've also seen many format characters in legitimate email, including in > the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width > word joiner (use deprecated since 2002), and U+200C apparently occurs in > corporate sigs. So their mere presence isn't much evidence of > obfuscation. I presume they may prevent legitimate patterns being > matched, including by Bayes. > > So my patch was going to be something to eliminate Format characters > from get_rendered_body_text_array() like: > --- lib/Mail/SpamAssassin/Message.pm(revision 1844922) > +++ lib/Mail/SpamAssassin/Message.pm(working copy) > @@ -1167,6 +1167,8 @@ >$text =~ s/\n+\s*\n+/\x00/gs;# double newlines => null > # $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) => > space > # $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single > space > + # do not render zero-width Unicode characters used as obfuscation: > + $text =~ > > s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs; >$text =~ s/\s+/ /gs; # Unicode whitespace => single > space >$text =~ tr/\x00/\n/;# null => newline > > One problem here is that I'm not clear at this point if $text is a > intended to be a character string (UTF8 flag set) or a byte string, and > the code immediately following tests this with `if > utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which > is also a continuation byte in UTF-8 encoding such as in the letter 'í' > (LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if > $text is a byte string. > > Prior to SA 3.4.1, it seems sometimes body rules would be matching > against a character string, and sometimes against a binary string. This > is mentioned in bug 7490, where a single '.' was matching 'á' until > version SA 3.4.1. As a postscript to that bug, I suspect what was > happening was 'normalize_charset 1' was set, and _normalize() was > attempting utf8::downgrade() but failed, perhaps because the message >
Evasion with Unicode format characters
Hello I thought of submitting a patch via Bugzilla, but then decided to first ask and check that I understood the general principles of body checks, and SpamAssassin's current approach to Unicode. Apologies for the length of this message. I hope the main points make sense. A fair number of webcam bitcoin 'sextortion' scams have evaded detection and worried recipients because of including relevant credentials. (Incidentally, I assume the credentials and addresses are mostly from the 2012 LinkedIn breach, but someone on the RIPE abuse list reports Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of this spam, but on writing body regexes to catch the wave around 16 October, I noticed that my rules weren't matching because the source was liberally injected with invisible characters: Content preview: I am aware blabla is one of your pass. Lets get straight to point. Not one These characters are encoded as decimal HTML entities and in the text/plain part as UTF-8 byte sequences. Without working these characters into a body rule pattern, that pattern will not match, yet such Unicode 'format' characters barely affect display or legibility, if at all. This could be a more general concern about obfuscation. Invisible characters could be used to evade all the ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format' characters in Unicode: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format:] I find it counterintuitive that such non-printing characters match [:print:] and [:graph:] rather than [:cntrl:], but this is how the classes are defined at: https://www.unicode.org/reports/tr18/#Compatibility_Properties As minor points, 'Format' excludes a couple of separator characters in the same range that instead match [:space:] https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character:] Then there is the C1 [:cntrl:] set, which some MUA's may render silently, I think including the 0x9D matched by the recent __UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?): https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control:] Finally, there may be a case for including as 'almost' invisible narrow blanks like U+200A U+202F and maybe U+205F. The Perl Unicode database may not be completely up-to-date here, and Perl 5.18 doesn't recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24 does. I've also seen many format characters in legitimate email, including in the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width word joiner (use deprecated since 2002), and U+200C apparently occurs in corporate sigs. So their mere presence isn't much evidence of obfuscation. I presume they may prevent legitimate patterns being matched, including by Bayes. So my patch was going to be something to eliminate Format characters from get_rendered_body_text_array() like: --- lib/Mail/SpamAssassin/Message.pm(revision 1844922) +++ lib/Mail/SpamAssassin/Message.pm(working copy) @@ -1167,6 +1167,8 @@ $text =~ s/\n+\s*\n+/\x00/gs;# double newlines => null # $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) => space # $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single space + # do not render zero-width Unicode characters used as obfuscation: + $text =~ s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs; $text =~ s/\s+/ /gs; # Unicode whitespace => single space $text =~ tr/\x00/\n/;# null => newline One problem here is that I'm not clear at this point if $text is a intended to be a character string (UTF8 flag set) or a byte string, and the code immediately following tests this with `if utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which is also a continuation byte in UTF-8 encoding such as in the letter 'í' (LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if $text is a byte string. Prior to SA 3.4.1, it seems sometimes body rules would be matching against a character string, and sometimes against a binary string. This is mentioned in bug 7490, where a single '.' was matching 'á' until version SA 3.4.1. As a postscript to that bug, I suspect what was happening was 'normalize_charset 1' was set, and _normalize() was attempting utf8::downgrade() but failed, perhaps because the message contained some non-Latin-1 text. On the other hand, will `s/\s+/ /gs` fail to normalise all Unicode [:blank:] characters correctly unless $text is marked as a character string? What are the design decisions here? Can I find them on this list, the wiki or elsewhere? Also what is the approach to 7-bit characters [\x00-\x1f\x7f] ? Here are some significant commits that seem to be work make the process of decoding and rendering more reliable and more like email client display but don't solve the format character issue: