I've been looking at Zero-Width chars and the evasion.  Look at KAM.cf and
search ZWNJ and KAM_CRIM rules and see if it helps.
--
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


On Tue, Oct 30, 2018 at 7:07 AM Cedric Knight <[email protected]> wrote:

> Hello
>
> I thought of submitting a patch via Bugzilla, but then decided to first
> ask and check that I understood the general principles of body checks,
> and SpamAssassin's current approach to Unicode. Apologies for the length
> of this message. I hope the main points make sense.
>
> A fair number of webcam bitcoin 'sextortion' scams have evaded detection
> and worried recipients because of including relevant credentials.
> (Incidentally, I assume the credentials and addresses are mostly from
> the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
> Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
> this spam, but on writing body regexes to catch the wave around 16
> October, I noticed that my rules weren't matching because the source was
> liberally injected with invisible characters:
> Content preview:  I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of
> your pa<U+200C>ss. L<U+200C>ets   g<U+200C>et strai<U+200C>ght
> to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e
>
> These characters are encoded as decimal HTML entities &#8204; and in the
> text/plain part as UTF-8 byte sequences.
>
> Without working these characters into a body rule pattern, that pattern
> will not match, yet such Unicode 'format' characters barely affect
> display or legibility, if at all. This could be a more general concern
> about obfuscation. Invisible characters could be used to evade all the
> ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format'
> characters in Unicode:
>
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format
> :]
> I find it counterintuitive that such non-printing characters match
> [:print:] and [:graph:] rather than [:cntrl:], but this is how the
> classes are defined at:
> https://www.unicode.org/reports/tr18/#Compatibility_Properties
>
> As minor points, 'Format' excludes a couple of separator characters in
> the same range that instead match [:space:]
>
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character
> :]
> Then there is the C1 [:cntrl:] set, which some MUA's may render
> silently, I think including the 0x9D matched by the recent
> __UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?):
>
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control
> :]
> Finally, there may be a case for including as 'almost' invisible narrow
> blanks like U+200A &hairsp; U+202F and maybe U+205F. The Perl Unicode
> database may not be completely up-to-date here, and Perl 5.18 doesn't
> recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
> does.
>
> I've also seen many format characters in legitimate email, including in
> the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width
> word joiner (use deprecated since 2002), and U+200C apparently occurs in
> corporate sigs. So their mere presence isn't much evidence of
> obfuscation. I presume they may prevent legitimate patterns being
> matched, including by Bayes.
>
> So my patch was going to be something to eliminate Format characters
> from get_rendered_body_text_array() like:
> --- lib/Mail/SpamAssassin/Message.pm    (revision 1844922)
> +++ lib/Mail/SpamAssassin/Message.pm    (working copy)
> @@ -1167,6 +1167,8 @@
>    $text =~ s/\n+\s*\n+/\x00/gs;                # double newlines => null
>  # $text =~ tr/ \t\n\r\x0b\xa0/ /s;     # whitespace (incl. VT, NBSP) =>
> space
>  # $text =~ tr/ \t\n\r\x0b/ /s;         # whitespace (incl. VT) => single
> space
> +  # do not render zero-width Unicode characters used as obfuscation:
> +  $text =~
>
> s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
>    $text =~ s/\s+/ /gs;                 # Unicode whitespace => single
> space
>    $text =~ tr/\x00/\n/;                        # null => newline
>
> One problem here is that I'm not clear at this point if $text is a
> intended to be a character string (UTF8 flag set) or a byte string, and
> the code immediately following tests this with `if
> utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which
> is also a continuation byte in UTF-8 encoding such as in the letter 'í'
> (LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if
> $text is a byte string.
>
> Prior to SA 3.4.1, it seems sometimes body rules would be matching
> against a character string, and sometimes against a binary string. This
> is mentioned in bug 7490, where a single '.' was matching 'á' until
> version SA 3.4.1. As a postscript to that bug, I suspect what was
> happening was 'normalize_charset 1' was set, and _normalize() was
> attempting utf8::downgrade() but failed, perhaps because the message
> contained some non-Latin-1 text.
>
> On the other hand, will `s/\s+/ /gs` fail to normalise all Unicode
> [:blank:] characters correctly unless $text is marked as a character
> string? What are the design decisions here? Can I find them on this
> list, the wiki or elsewhere? Also what is the approach to 7-bit
> characters [\x00-\x1f\x7f] ?
>
> Here are some significant commits that seem to be work make the process
> of decoding and rendering more reliable and more like email client
> display but don't solve the format character issue:
>
> http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message.pm?r1=1707582&r2=1707597
>
> http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message/Node.pm?r1=1749286&r2=1749798
>
> IMHO it would be nice if it were possible to change related behaviour
> via a plugin, at the parsed_metadata() or start_rules() hook, but AFAICS
> there is no way for a plugin to alter the rendered message.  You can use
> `replace_rules`/`replace_tag` to pre-process a rule (this fuzziness has
> the advantage that the same code-point may obfuscate, say, both I and L,
> but doesn't help much with invisible characters at the moment). However,
> there is nothing to pre-process and canonicalise the text being matched
> to simplify rule writing.
>
> I have often been unclear on what I need to do to get a body rule to
> match accented or Cyrillic characters, sometimes checking the byte
> stream in different encodings and transcribing to hex by hand. 'rawbody'
> rules should no doubt match the encoded 'raw' data, but I wonder if
> 'body' rules would work better if they concentrated on the meaning of
> the words without having to worry about multiple possible encodings and
> transmission systems. So if I can venture a radical suggestion, should
> body rules actually match against a character string, as they have
> sometimes been doing apparently unintentionally?  Could this be a
> configuration setting, as a function of or in addition to
> normalize_charset?
>
> Very little cannot be represented in a character string, which seems to
> be Perl's preferred model since version 5.8. Although there may be some
> obscure encodings that could require some work to decode, is it better
> to decode and normalise what can be decoded reasonably reliably, and
> represent the rest as Unicode code points with the same value as the
> bytes? (That should match \xNN for rare encodings.) Is there still a
> performance issue? To make such functionality (if enabled) as compatible
> as possible with existing rulesets, the Conf module might detect valid
> UTF-8 literals in body regexes and decode those, and where there are
> \xNN escape sequences (up to 62 subrules in main rules), if they form
> valid contiguous UTF-8, they can be decoded too. Where there are more
> complex sequences like __BENEFICIARY or
> \xef(?:\xbf[\xb9-\xbb]|\xbb\xbf), then perhaps those should have been in
> rawbody rules anyway, or rewritten to be encoding-independent and
> eliminate any finesses of Unicode like the Format characters?
>
> I'd be grateful for advice as to whether there's merit in filing these
> concerns as one or more issues on Bugzilla, or for relevant background.
>
> CK
>

Reply via email to