Re: [Assp-test] soft hyphen fooling Bayesian analysis

Thomas Eckardt Wed, 07 Sep 2022 06:13:57 -0700

If unicode normalization NFKC does'nt fulfill your requirement, you may 
enable 'DoTransliterate' - by accepting some performance penalties.

The "Unicode Technical Standard #39" http://www.unicode.org/reports/tr39/ 
will give you some more information and 
https://www.unicode.org/Public/security/revision-05/intentional.txt shows 
a nice table for cyrillic and greek.
If someone expects an ASCII mail, those translations may somehow help. But 
in all other cases (100% cyrillic/greek/....), such a character 
replacement is contra-productive (for example: not all cyrillic letters 
have a valid latin replacement).

> potentially treat look-alike characters as the latin character for 
bayesian purposes

The HMM and Bayesian engines are using heuristic mechanism. Trying to 
treat single characters as latin (or anything else) will not worth the 
effort. Over a short periode of time, both engines will have learned also 
obscured words (word combinations).

Thomas

Von:    "K Post" <nntp.p...@gmail.com>
An:     "ASSP development mailing list" <assp-test@lists.sourceforge.net>
Datum:  06.09.2022 21:31
Betreff:        Re: [Assp-test] soft hyphen fooling Bayesian analysis

Eager to see what you come up with in terms of ignoring the soft hyphen.  

 Your <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> regex is clear, and I 
understand using that for scoring purposes, but I'm looking for a way to 
potentially treat look-alike characters as the latin character for 
bayesian purposes and/or to catch commonly obscured words (like 
GeekSquad).  Is it okay if I reply further in my  August 1 post here to 
keep that in the same thread?

On Tue, Sep 6, 2022 at 2:06 PM Thomas Eckardt <thomas.ecka...@thockar.com> 
wrote:
>HTML::strip 

html parsing to get text parts has nothing to do with html de(en)coding 

>iso-8559-1 
ASSP processes all content as UTF-8 

>&shy; 
ASSP is aware about this - and replaces soft-hyphens with hard-hyphens - 
and multiple concurrent hard-hyphens with a single one 
How ever - the option to remove the soft-hyphens instead, sounds somehow 
better. Tests are still running. 

>My thinking is that if it doesn't display..... 
ASSP does'nt know if something displayed or not (and will never know it) 

>I suspect that other characters will be abused in the same way 
&nbsp; as well as several BIG5, numerical and other unicode characters are 
already special handled by assp. Other CTL-chars are ignored by assp. 
Everything is converted to UTF8, unicode normalized (including grapheme 
clusters), stemmed and simplyfied. 

>This kind of obfuscation goes hand in hand with my previous questions 
about considering some non-Latin characters that look like Latin 
characters as those Latin alphabet characters.  

With some unicode knowledge, some help from the analyzer and some regex 
knowledge - such things are easy to find 
for example : <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> 
finds a sequence where cyrillic (a p b ....) are used in words - commonly 
used by spammers 

Thomas 

Von:        "K Post" <nntp.p...@gmail.com> 
An:        "ASSP development mailing list" <
assp-test@lists.sourceforge.net> 
Datum:        06.09.2022 16:16 
Betreff:        [Assp-test] soft hyphen fooling Bayesian analysis 

Is there a way to improve the way that ASSP parses certain special, 
non-printing, characters?  I'm having trouble with spam emails that have 
their body heavily obfuscated with "soft hyphens" slipping through.  They 
all seem to have multipart bodies, first with an iso-8559-1 text part with 
=AD interterspersed in words and then an html part with &shy; all over the 
place.  These are the "soft hyphen," a hyphen that only prints if it is 
needed to break the word to the next line.  It's clever.  The user doesn't 
see the character, but ASSP thinks it's a word boundary.   

The part first part 
Content-Type: text/plain; charset="iso-8859-1" 
Content-Transfer-Encoding: quoted-printable 
will be plain text, and have have spammy words with =AD inserted in the 
middle of them, for example, "This is a sentence with spammy phrase." 
could be written something like  
This is a sentence with sp=ADammy p=ADhr=ADase. 

The next mime part is the html, which does the same thing, but uses &shy; 
(html for soft hyphen) mid-word.  So, something like: 
<p>This is a sentence with sp&shy;ammy p&shy;hr&shy;ase in it</p> 

The whole body of the message is filled with these soft hyphens anywhere 
that there's spammy words/phrases, and in many cases, there are soft 
hyphens every couple of letters across the entire body.  When I do an 
analysis, it appears that the soft hyphen tricks ASSP into thinking that 
each part of the word is a separate word, so for sp&shy;ammy 
p&shy;hr&shy;ase, it thinks the words are 
sp ammy p hr ase 

I am using HTML::strip.  Would TreeBuilder work better?  I'm concerned 
about performance there. 

Is there a way (and is it a good idea) to somehow instruct ASSP to treat 
certain html special characters as ones to ignore, and others to be 
treated as a word separator?  My thinking is that if it doesn't display, 
then it should be ignored when doing bayesian / HMM evaluation. 

https://cs.stanford.edu/people/miles/iso8859.html has a bunch of Control 
Characters and Special Characters that don't print - or in the case of the 
soft hyphen, only print when the contained word is at the end of a line.  
I suspect that other characters will be abused in the same way. 

This kind of obfuscation goes hand in hand with my previous questions 
about considering some non-Latin characters that look like Latin 
characters as those Latin alphabet characters.  

Thanks 

[Anhang "attz351u.txt" gelöscht von Thomas Eckardt/eck] [Anhang 
"att8gq15.txt" gelöscht von Thomas Eckardt/eck] 

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test[Anhang 
"att8rbj5.txt" gelöscht von Thomas Eckardt/eck] [Anhang "atthrsos.txt" 
gelöscht von Thomas Eckardt/eck]

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Re: [Assp-test] soft hyphen fooling Bayesian analysis

Reply via email to