I've been looking at Zero-Width chars and the evasion. Look at KAM.cf and search ZWNJ and KAM_CRIM rules and see if it helps. -- Kevin A. McGrail VP Fundraising, Apache Software Foundation Chair Emeritus Apache SpamAssassin Project https://www.linkedin.com/in/kmcgrail - 703.798.0171
On Tue, Oct 30, 2018 at 7:07 AM Cedric Knight <[email protected]> wrote: > Hello > > I thought of submitting a patch via Bugzilla, but then decided to first > ask and check that I understood the general principles of body checks, > and SpamAssassin's current approach to Unicode. Apologies for the length > of this message. I hope the main points make sense. > > A fair number of webcam bitcoin 'sextortion' scams have evaded detection > and worried recipients because of including relevant credentials. > (Incidentally, I assume the credentials and addresses are mostly from > the 2012 LinkedIn breach, but someone on the RIPE abuse list reports > Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of > this spam, but on writing body regexes to catch the wave around 16 > October, I noticed that my rules weren't matching because the source was > liberally injected with invisible characters: > Content preview: I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of > your pa<U+200C>ss. L<U+200C>ets g<U+200C>et strai<U+200C>ght > to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e > > These characters are encoded as decimal HTML entities ‌ and in the > text/plain part as UTF-8 byte sequences. > > Without working these characters into a body rule pattern, that pattern > will not match, yet such Unicode 'format' characters barely affect > display or legibility, if at all. This could be a more general concern > about obfuscation. Invisible characters could be used to evade all the > ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format' > characters in Unicode: > > https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format > :] > I find it counterintuitive that such non-printing characters match > [:print:] and [:graph:] rather than [:cntrl:], but this is how the > classes are defined at: > https://www.unicode.org/reports/tr18/#Compatibility_Properties > > As minor points, 'Format' excludes a couple of separator characters in > the same range that instead match [:space:] > > https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character > :] > Then there is the C1 [:cntrl:] set, which some MUA's may render > silently, I think including the 0x9D matched by the recent > __UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?): > > https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control > :] > Finally, there may be a case for including as 'almost' invisible narrow > blanks like U+200A   U+202F and maybe U+205F. The Perl Unicode > database may not be completely up-to-date here, and Perl 5.18 doesn't > recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24 > does. > > I've also seen many format characters in legitimate email, including in > the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width > word joiner (use deprecated since 2002), and U+200C apparently occurs in > corporate sigs. So their mere presence isn't much evidence of > obfuscation. I presume they may prevent legitimate patterns being > matched, including by Bayes. > > So my patch was going to be something to eliminate Format characters > from get_rendered_body_text_array() like: > --- lib/Mail/SpamAssassin/Message.pm (revision 1844922) > +++ lib/Mail/SpamAssassin/Message.pm (working copy) > @@ -1167,6 +1167,8 @@ > $text =~ s/\n+\s*\n+/\x00/gs; # double newlines => null > # $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) => > space > # $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single > space > + # do not render zero-width Unicode characters used as obfuscation: > + $text =~ > > s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs; > $text =~ s/\s+/ /gs; # Unicode whitespace => single > space > $text =~ tr/\x00/\n/; # null => newline > > One problem here is that I'm not clear at this point if $text is a > intended to be a character string (UTF8 flag set) or a byte string, and > the code immediately following tests this with `if > utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which > is also a continuation byte in UTF-8 encoding such as in the letter 'í' > (LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if > $text is a byte string. > > Prior to SA 3.4.1, it seems sometimes body rules would be matching > against a character string, and sometimes against a binary string. This > is mentioned in bug 7490, where a single '.' was matching 'á' until > version SA 3.4.1. As a postscript to that bug, I suspect what was > happening was 'normalize_charset 1' was set, and _normalize() was > attempting utf8::downgrade() but failed, perhaps because the message > contained some non-Latin-1 text. > > On the other hand, will `s/\s+/ /gs` fail to normalise all Unicode > [:blank:] characters correctly unless $text is marked as a character > string? What are the design decisions here? Can I find them on this > list, the wiki or elsewhere? Also what is the approach to 7-bit > characters [\x00-\x1f\x7f] ? > > Here are some significant commits that seem to be work make the process > of decoding and rendering more reliable and more like email client > display but don't solve the format character issue: > > http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message.pm?r1=1707582&r2=1707597 > > http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message/Node.pm?r1=1749286&r2=1749798 > > IMHO it would be nice if it were possible to change related behaviour > via a plugin, at the parsed_metadata() or start_rules() hook, but AFAICS > there is no way for a plugin to alter the rendered message. You can use > `replace_rules`/`replace_tag` to pre-process a rule (this fuzziness has > the advantage that the same code-point may obfuscate, say, both I and L, > but doesn't help much with invisible characters at the moment). However, > there is nothing to pre-process and canonicalise the text being matched > to simplify rule writing. > > I have often been unclear on what I need to do to get a body rule to > match accented or Cyrillic characters, sometimes checking the byte > stream in different encodings and transcribing to hex by hand. 'rawbody' > rules should no doubt match the encoded 'raw' data, but I wonder if > 'body' rules would work better if they concentrated on the meaning of > the words without having to worry about multiple possible encodings and > transmission systems. So if I can venture a radical suggestion, should > body rules actually match against a character string, as they have > sometimes been doing apparently unintentionally? Could this be a > configuration setting, as a function of or in addition to > normalize_charset? > > Very little cannot be represented in a character string, which seems to > be Perl's preferred model since version 5.8. Although there may be some > obscure encodings that could require some work to decode, is it better > to decode and normalise what can be decoded reasonably reliably, and > represent the rest as Unicode code points with the same value as the > bytes? (That should match \xNN for rare encodings.) Is there still a > performance issue? To make such functionality (if enabled) as compatible > as possible with existing rulesets, the Conf module might detect valid > UTF-8 literals in body regexes and decode those, and where there are > \xNN escape sequences (up to 62 subrules in main rules), if they form > valid contiguous UTF-8, they can be decoded too. Where there are more > complex sequences like __BENEFICIARY or > \xef(?:\xbf[\xb9-\xbb]|\xbb\xbf), then perhaps those should have been in > rawbody rules anyway, or rewritten to be encoding-independent and > eliminate any finesses of Unicode like the Format characters? > > I'd be grateful for advice as to whether there's merit in filing these > concerns as one or more issues on Bugzilla, or for relevant background. > > CK >
