Re: [Assp-test] soft hyphen fooling Bayesian analysis

Thomas Eckardt Tue, 06 Sep 2022 11:06:36 -0700

>HTML::strip

html parsing to get text parts has nothing to do with html de(en)coding



>iso-8559-1
ASSP processes all content as UTF-8


>&shy;
ASSP is aware about this - and replaces soft-hyphens with hard-hyphens - 
and multiple concurrent hard-hyphens with a single one
How ever - the option to remove the soft-hyphens instead, sounds somehow 
better. Tests are still running.

>My thinking is that if it doesn't display.....
ASSP does'nt know if something displayed or not (and will never know it)


>I suspect that other characters will be abused in the same way
&nbsp; as well as several BIG5, numerical and other unicode characters are 
already special handled by assp. Other CTL-chars are ignored by assp.
Everything is converted to UTF8, unicode normalized (including grapheme 
clusters), stemmed and simplyfied.


>This kind of obfuscation goes hand in hand with my previous questions 
about considering some non-Latin characters that look like Latin 
characters as those Latin alphabet characters. 

With some unicode knowledge, some help from the analyzer and some regex 
knowledge - such things are easy to find
for example : <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>>
finds a sequence where cyrillic (a p b ....) are used in words - commonly 
used by spammers

Thomas



Von:    "K Post" <nntp.p...@gmail.com>
An:     "ASSP development mailing list" <assp-test@lists.sourceforge.net>
Datum:  06.09.2022 16:16
Betreff:        [Assp-test] soft hyphen fooling Bayesian analysis




Is there a way to improve the way that ASSP parses certain special, 
non-printing, characters?  I'm having trouble with spam emails that have 
their body heavily obfuscated with "soft hyphens" slipping through.  They 
all seem to have multipart bodies, first with an iso-8559-1 text part with 
=AD interterspersed in words and then an html part with &shy; all over the 
place.  These are the "soft hyphen," a hyphen that only prints if it is 
needed to break the word to the next line.  It's clever.  The user doesn't 
see the character, but ASSP thinks it's a word boundary.  

The part first part
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
will be plain text, and have have spammy words with =AD inserted in the 
middle of them, for example, "This is a sentence with spammy phrase." 
could be written something like 
This is a sentence with sp=ADammy p=ADhr=ADase.

The next mime part is the html, which does the same thing, but uses &shy; 
(html for soft hyphen) mid-word.  So, something like:
<p>This is a sentence with sp&shy;ammy p&shy;hr&shy;ase in it</p>

The whole body of the message is filled with these soft hyphens anywhere 
that there's spammy words/phrases, and in many cases, there are soft 
hyphens every couple of letters across the entire body.  When I do an 
analysis, it appears that the soft hyphen tricks ASSP into thinking that 
each part of the word is a separate word, so for sp&shy;ammy 
p&shy;hr&shy;ase, it thinks the words are
sp ammy p hr ase

I am using HTML::strip.  Would TreeBuilder work better?  I'm concerned 
about performance there.

Is there a way (and is it a good idea) to somehow instruct ASSP to treat 
certain html special characters as ones to ignore, and others to be 
treated as a word separator?  My thinking is that if it doesn't display, 
then it should be ignored when doing bayesian / HMM evaluation.

https://cs.stanford.edu/people/miles/iso8859.html has a bunch of Control 
Characters and Special Characters that don't print - or in the case of the 
soft hyphen, only print when the contained word is at the end of a line.  
I suspect that other characters will be abused in the same way.

This kind of obfuscation goes hand in hand with my previous questions 
about considering some non-Latin characters that look like Latin 
characters as those Latin alphabet characters. 

Thanks





[Anhang "attz351u.txt" gelöscht von Thomas Eckardt/eck] [Anhang 
"att8gq15.txt" gelöscht von Thomas Eckardt/eck]

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Re: [Assp-test] soft hyphen fooling Bayesian analysis

Reply via email to