Re: Evasion with Unicode format characters

2018-10-30 Thread Bill Cole

On 30 Oct 2018, at 7:07, Cedric Knight wrote:


I'd be grateful for advice as to whether there's merit in filing these
concerns as one or more issues on Bugzilla, or for relevant 
background.


I do not believe the codebase is the place to address these issues, 
which are addressable in carefully created rules. Because your approach 
would hide useful data patterns from rules, it is exactly the wrong way 
to go about "solving" a problem with a novel flavor of spam. As John & 
Kevin have noted, they have worked on the specific case of the extortion 
spams in publicly available rules. I also have an ancient bundle of 
rules that I've been adjusting for the modern world and existence 
outside of my idiosyncratic environment (where severe FPs are 
evaded/mitigated) which is promising and will be public in some way 
soon.


Also, change this substantial in the core behavior of SA would be almost 
certain to NOT get into 3.4.3, which will be out soon and is likely to 
be dominant in production systems for some time despite the (coming 
soon) 4.0 release. If this were done in code rather than in rules, it 
would never be usable for sites not ready or able to go to 4.0


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole


[Bug 7270] TxRep SQL duplicate entry errors in log

2018-10-30 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7270

Giovanni Bechis  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #3 from Giovanni Bechis  ---
On the database, primary key is made by "username,email,signedby,ip", a
duplicate key is possible only by submitting to SA the same email more than
once.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Re: Tons of errors in todays masscheck

2018-10-30 Thread Kevin A. McGrail
I'd say the minute we get 3.4.3 out the door.  I sent you a private email
to see if we can chat about the big blocker for that?
--
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


On Tue, Oct 30, 2018 at 2:06 PM Henrik K  wrote:

>
> On that note, does anyone want to entertain a timeline when a separate
> "stable" 4.0.0 branch would be created?  It's kind of hard to test things
> in
> trunk, when it's also used for masschecks daily.
>
> On Tue, Oct 30, 2018 at 08:03:05PM +0200, Henrik K wrote:
> >
> > Sorry that was some of my trunk thingies, fixed it already today..
> >
> > On Tue, Oct 30, 2018 at 06:01:22PM +0100, Axb wrote:
> > > Seems he's missing:
> > >
> > > echo "bayes_auto_learn 0" > spamassassin/user_prefs
> > > echo "use_bayes 0" >> spamassassin/user_prefs
> > >
> > > in his masscheck script
> > >
> > > On 10/30/18 4:53 PM, Kevin A. McGrail wrote:
> > > >Not good... emailing more lists...
> > > >
> > > >On 10/30/2018 11:15 AM, Jari Fredriksson wrote:
> > > >>I had a kilometre long feedback mail from masscheckworker which
> ended like this.
> > > >>
> > > >>
> > > >>locker: creating link
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock
> to
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698
> failed: File exists at
> /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> line 91.
> > > >>locker: creating link
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock
> to
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694
> failed: File exists at
> /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> line 91.
> > > >>locker: creating link
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock
> to
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697
> failed: File exists at
> /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> line 91.
> > > >>locker: creating link
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock
> to
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696
> failed: File exists at
> /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> line 91.
> > > >>11:09:51 up  1:09,  0 users,  load average: 14.76, 14.85, 14.35
> > > >>rsync -Pcqz  ham-jarif.log spam-jarif.log *munged*/
> > > >>11:10:06 up  1:09,  0 users,  load average: 11.64, 14.16, 14.13
> > > >>
> > > >
>


Re: Tons of errors in todays masscheck

2018-10-30 Thread Henrik K


On that note, does anyone want to entertain a timeline when a separate
"stable" 4.0.0 branch would be created?  It's kind of hard to test things in
trunk, when it's also used for masschecks daily.

On Tue, Oct 30, 2018 at 08:03:05PM +0200, Henrik K wrote:
> 
> Sorry that was some of my trunk thingies, fixed it already today..
> 
> On Tue, Oct 30, 2018 at 06:01:22PM +0100, Axb wrote:
> > Seems he's missing:
> > 
> > echo "bayes_auto_learn 0" > spamassassin/user_prefs
> > echo "use_bayes 0" >> spamassassin/user_prefs
> > 
> > in his masscheck script
> > 
> > On 10/30/18 4:53 PM, Kevin A. McGrail wrote:
> > >Not good... emailing more lists...
> > >
> > >On 10/30/2018 11:15 AM, Jari Fredriksson wrote:
> > >>I had a kilometre long feedback mail from masscheckworker which ended 
> > >>like this.
> > >>
> > >>
> > >>locker: creating link 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock
> > >> to 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698
> > >> failed: File exists at 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> > >> line 91.
> > >>locker: creating link 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock
> > >> to 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694
> > >> failed: File exists at 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> > >> line 91.
> > >>locker: creating link 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock
> > >> to 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697
> > >> failed: File exists at 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> > >> line 91.
> > >>locker: creating link 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock
> > >> to 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696
> > >> failed: File exists at 
> > >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> > >> line 91.
> > >>11:09:51 up  1:09,  0 users,  load average: 14.76, 14.85, 14.35
> > >>rsync -Pcqz  ham-jarif.log spam-jarif.log *munged*/
> > >>11:10:06 up  1:09,  0 users,  load average: 11.64, 14.16, 14.13
> > >>
> > >


Re: Tons of errors in todays masscheck

2018-10-30 Thread Henrik K


Sorry that was some of my trunk thingies, fixed it already today..

On Tue, Oct 30, 2018 at 06:01:22PM +0100, Axb wrote:
> Seems he's missing:
> 
> echo "bayes_auto_learn 0" > spamassassin/user_prefs
> echo "use_bayes 0" >> spamassassin/user_prefs
> 
> in his masscheck script
> 
> On 10/30/18 4:53 PM, Kevin A. McGrail wrote:
> >Not good... emailing more lists...
> >
> >On 10/30/2018 11:15 AM, Jari Fredriksson wrote:
> >>I had a kilometre long feedback mail from masscheckworker which ended like 
> >>this.
> >>
> >>
> >>locker: creating link 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock 
> >>to 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698
> >> failed: File exists at 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> >> line 91.
> >>locker: creating link 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock 
> >>to 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694
> >> failed: File exists at 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> >> line 91.
> >>locker: creating link 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock 
> >>to 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697
> >> failed: File exists at 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> >> line 91.
> >>locker: creating link 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock 
> >>to 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696
> >> failed: File exists at 
> >>/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
> >> line 91.
> >>11:09:51 up  1:09,  0 users,  load average: 14.76, 14.85, 14.35
> >>rsync -Pcqz  ham-jarif.log spam-jarif.log *munged*/
> >>11:10:06 up  1:09,  0 users,  load average: 11.64, 14.16, 14.13
> >>
> >


Re: Tons of errors in todays masscheck

2018-10-30 Thread Axb

Seems he's missing:

echo "bayes_auto_learn 0" > spamassassin/user_prefs
echo "use_bayes 0" >> spamassassin/user_prefs

in his masscheck script

On 10/30/18 4:53 PM, Kevin A. McGrail wrote:

Not good... emailing more lists...

On 10/30/2018 11:15 AM, Jari Fredriksson wrote:

I had a kilometre long feedback mail from masscheckworker which ended like this.


locker: creating link 
/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock to 
/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698
 failed: File exists at 
/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
 line 91.
locker: creating link 
/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock to 
/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694
 failed: File exists at 
/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
 line 91.
locker: creating link 
/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock to 
/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697
 failed: File exists at 
/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
 line 91.
locker: creating link 
/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock to 
/home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696
 failed: File exists at 
/home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
 line 91.
11:09:51 up  1:09,  0 users,  load average: 14.76, 14.85, 14.35
rsync -Pcqz  ham-jarif.log spam-jarif.log *munged*/
11:10:06 up  1:09,  0 users,  load average: 11.64, 14.16, 14.13







Re: Tons of errors in todays masscheck

2018-10-30 Thread Kevin A. McGrail
Not good... emailing more lists...

On 10/30/2018 11:15 AM, Jari Fredriksson wrote:
> I had a kilometre long feedback mail from masscheckworker which ended like 
> this.
>
>
> locker: creating link 
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock 
> to 
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12698
>  failed: File exists at 
> /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
>  line 91.
> locker: creating link 
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock 
> to 
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12694
>  failed: File exists at 
> /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
>  line 91.
> locker: creating link 
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock 
> to 
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12697
>  failed: File exists at 
> /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
>  line 91.
> locker: creating link 
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock 
> to 
> /home/jarif/masscheckwork/nightly_mass_check/masses/spamassassin/bayes.lock.sa-ruleqa.c.sa-ruleqa.internal.12696
>  failed: File exists at 
> /home/jarif/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
>  line 91.
> 11:09:51 up  1:09,  0 users,  load average: 14.76, 14.85, 14.35
> rsync -Pcqz  ham-jarif.log spam-jarif.log *munged*/
> 11:10:06 up  1:09,  0 users,  load average: 11.64, 14.16, 14.13
>

-- 
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171



[Bug 6728] DNSBLs need a way to turn off queries based on BLOCKED rules triggering

2018-10-30 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6728

--- Comment #32 from Henrik Krohns  ---
(In reply to Kevin A. McGrail from comment #31)
> Hah, I was just about to ask.  Do you have it described in the UPGRADE and
> release notes?

It's mentioned in UPGRADE and of course the option is perldocced. It's more of
an internal option anyway, doesn't make much sense to use beyond our own update
channel.

Good luck creating 4.0.0 release notes from scratch, I think even UPGRADE is
missing years of changes. :-D

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6728] DNSBLs need a way to turn off queries based on BLOCKED rules triggering

2018-10-30 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6728

--- Comment #31 from Kevin A. McGrail  ---
Hah, I was just about to ask.  Do you have it described in the UPGRADE and
release notes?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6728] DNSBLs need a way to turn off queries based on BLOCKED rules triggering

2018-10-30 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6728

Henrik Krohns  changed:

   What|Removed |Added

   Target Milestone|Undefined   |4.0.0

--- Comment #30 from Henrik Krohns  ---
Just to clarify, it will be in 4.0.0.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6728] DNSBLs need a way to turn off queries based on BLOCKED rules triggering

2018-10-30 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6728

Henrik Krohns  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #29 from Henrik Krohns  ---
After few littles fixes, I consider dns_block_rule now working. It's also
committed to rules. Resolving.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Re: Evasion with Unicode format characters

2018-10-30 Thread Kevin A. McGrail
I've been looking at Zero-Width chars and the evasion.  Look at KAM.cf and
search ZWNJ and KAM_CRIM rules and see if it helps.
--
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


On Tue, Oct 30, 2018 at 7:07 AM Cedric Knight  wrote:

> Hello
>
> I thought of submitting a patch via Bugzilla, but then decided to first
> ask and check that I understood the general principles of body checks,
> and SpamAssassin's current approach to Unicode. Apologies for the length
> of this message. I hope the main points make sense.
>
> A fair number of webcam bitcoin 'sextortion' scams have evaded detection
> and worried recipients because of including relevant credentials.
> (Incidentally, I assume the credentials and addresses are mostly from
> the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
> Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
> this spam, but on writing body regexes to catch the wave around 16
> October, I noticed that my rules weren't matching because the source was
> liberally injected with invisible characters:
> Content preview:  I am aware blabla is one of
> your pass. Lets   get straight
> to point. Not one
>
> These characters are encoded as decimal HTML entities  and in the
> text/plain part as UTF-8 byte sequences.
>
> Without working these characters into a body rule pattern, that pattern
> will not match, yet such Unicode 'format' characters barely affect
> display or legibility, if at all. This could be a more general concern
> about obfuscation. Invisible characters could be used to evade all the
> ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format'
> characters in Unicode:
>
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format
> :]
> I find it counterintuitive that such non-printing characters match
> [:print:] and [:graph:] rather than [:cntrl:], but this is how the
> classes are defined at:
> https://www.unicode.org/reports/tr18/#Compatibility_Properties
>
> As minor points, 'Format' excludes a couple of separator characters in
> the same range that instead match [:space:]
>
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character
> :]
> Then there is the C1 [:cntrl:] set, which some MUA's may render
> silently, I think including the 0x9D matched by the recent
> __UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?):
>
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control
> :]
> Finally, there may be a case for including as 'almost' invisible narrow
> blanks like U+200A  U+202F and maybe U+205F. The Perl Unicode
> database may not be completely up-to-date here, and Perl 5.18 doesn't
> recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
> does.
>
> I've also seen many format characters in legitimate email, including in
> the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width
> word joiner (use deprecated since 2002), and U+200C apparently occurs in
> corporate sigs. So their mere presence isn't much evidence of
> obfuscation. I presume they may prevent legitimate patterns being
> matched, including by Bayes.
>
> So my patch was going to be something to eliminate Format characters
> from get_rendered_body_text_array() like:
> --- lib/Mail/SpamAssassin/Message.pm(revision 1844922)
> +++ lib/Mail/SpamAssassin/Message.pm(working copy)
> @@ -1167,6 +1167,8 @@
>$text =~ s/\n+\s*\n+/\x00/gs;# double newlines => null
>  # $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) =>
> space
>  # $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single
> space
> +  # do not render zero-width Unicode characters used as obfuscation:
> +  $text =~
>
> s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
>$text =~ s/\s+/ /gs; # Unicode whitespace => single
> space
>$text =~ tr/\x00/\n/;# null => newline
>
> One problem here is that I'm not clear at this point if $text is a
> intended to be a character string (UTF8 flag set) or a byte string, and
> the code immediately following tests this with `if
> utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which
> is also a continuation byte in UTF-8 encoding such as in the letter 'í'
> (LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if
> $text is a byte string.
>
> Prior to SA 3.4.1, it seems sometimes body rules would be matching
> against a character string, and sometimes against a binary string. This
> is mentioned in bug 7490, where a single '.' was matching 'á' until
> version SA 3.4.1. As a postscript to that bug, I suspect what was
> happening was 'normalize_charset 1' was set, and _normalize() was
> attempting utf8::downgrade() but failed, perhaps because the message
> 

Evasion with Unicode format characters

2018-10-30 Thread Cedric Knight
Hello

I thought of submitting a patch via Bugzilla, but then decided to first
ask and check that I understood the general principles of body checks,
and SpamAssassin's current approach to Unicode. Apologies for the length
of this message. I hope the main points make sense.

A fair number of webcam bitcoin 'sextortion' scams have evaded detection
and worried recipients because of including relevant credentials.
(Incidentally, I assume the credentials and addresses are mostly from
the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
this spam, but on writing body regexes to catch the wave around 16
October, I noticed that my rules weren't matching because the source was
liberally injected with invisible characters:
Content preview:  I am aware blabla is one of
your pass. Lets   get straight
to point. Not one

These characters are encoded as decimal HTML entities  and in the
text/plain part as UTF-8 byte sequences.

Without working these characters into a body rule pattern, that pattern
will not match, yet such Unicode 'format' characters barely affect
display or legibility, if at all. This could be a more general concern
about obfuscation. Invisible characters could be used to evade all the
ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format'
characters in Unicode:
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format:]
I find it counterintuitive that such non-printing characters match
[:print:] and [:graph:] rather than [:cntrl:], but this is how the
classes are defined at:
https://www.unicode.org/reports/tr18/#Compatibility_Properties

As minor points, 'Format' excludes a couple of separator characters in
the same range that instead match [:space:]
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character:]
Then there is the C1 [:cntrl:] set, which some MUA's may render
silently, I think including the 0x9D matched by the recent
__UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?):
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control:]
Finally, there may be a case for including as 'almost' invisible narrow
blanks like U+200A  U+202F and maybe U+205F. The Perl Unicode
database may not be completely up-to-date here, and Perl 5.18 doesn't
recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
does.

I've also seen many format characters in legitimate email, including in
the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width
word joiner (use deprecated since 2002), and U+200C apparently occurs in
corporate sigs. So their mere presence isn't much evidence of
obfuscation. I presume they may prevent legitimate patterns being
matched, including by Bayes.

So my patch was going to be something to eliminate Format characters
from get_rendered_body_text_array() like:
--- lib/Mail/SpamAssassin/Message.pm(revision 1844922)
+++ lib/Mail/SpamAssassin/Message.pm(working copy)
@@ -1167,6 +1167,8 @@
   $text =~ s/\n+\s*\n+/\x00/gs;# double newlines => null
 # $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) => space
 # $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single space
+  # do not render zero-width Unicode characters used as obfuscation:
+  $text =~
s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
   $text =~ s/\s+/ /gs; # Unicode whitespace => single space
   $text =~ tr/\x00/\n/;# null => newline

One problem here is that I'm not clear at this point if $text is a
intended to be a character string (UTF8 flag set) or a byte string, and
the code immediately following tests this with `if
utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which
is also a continuation byte in UTF-8 encoding such as in the letter 'í'
(LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if
$text is a byte string.

Prior to SA 3.4.1, it seems sometimes body rules would be matching
against a character string, and sometimes against a binary string. This
is mentioned in bug 7490, where a single '.' was matching 'á' until
version SA 3.4.1. As a postscript to that bug, I suspect what was
happening was 'normalize_charset 1' was set, and _normalize() was
attempting utf8::downgrade() but failed, perhaps because the message
contained some non-Latin-1 text.

On the other hand, will `s/\s+/ /gs` fail to normalise all Unicode
[:blank:] characters correctly unless $text is marked as a character
string? What are the design decisions here? Can I find them on this
list, the wiki or elsewhere? Also what is the approach to 7-bit
characters [\x00-\x1f\x7f] ?

Here are some significant commits that seem to be work make the process
of decoding and rendering more reliable and more like email client
display but don't solve the format character issue: