>HTML::strip html parsing to get text parts has nothing to do with html de(en)coding
>iso-8559-1 ASSP processes all content as UTF-8 >­ ASSP is aware about this - and replaces soft-hyphens with hard-hyphens - and multiple concurrent hard-hyphens with a single one How ever - the option to remove the soft-hyphens instead, sounds somehow better. Tests are still running. >My thinking is that if it doesn't display..... ASSP does'nt know if something displayed or not (and will never know it) >I suspect that other characters will be abused in the same way as well as several BIG5, numerical and other unicode characters are already special handled by assp. Other CTL-chars are ignored by assp. Everything is converted to UTF8, unicode normalized (including grapheme clusters), stemmed and simplyfied. >This kind of obfuscation goes hand in hand with my previous questions about considering some non-Latin characters that look like Latin characters as those Latin alphabet characters. With some unicode knowledge, some help from the analyzer and some regex knowledge - such things are easy to find for example : <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> finds a sequence where cyrillic (a p b ....) are used in words - commonly used by spammers Thomas Von: "K Post" <nntp.p...@gmail.com> An: "ASSP development mailing list" <assp-test@lists.sourceforge.net> Datum: 06.09.2022 16:16 Betreff: [Assp-test] soft hyphen fooling Bayesian analysis Is there a way to improve the way that ASSP parses certain special, non-printing, characters? I'm having trouble with spam emails that have their body heavily obfuscated with "soft hyphens" slipping through. They all seem to have multipart bodies, first with an iso-8559-1 text part with =AD interterspersed in words and then an html part with ­ all over the place. These are the "soft hyphen," a hyphen that only prints if it is needed to break the word to the next line. It's clever. The user doesn't see the character, but ASSP thinks it's a word boundary. The part first part Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable will be plain text, and have have spammy words with =AD inserted in the middle of them, for example, "This is a sentence with spammy phrase." could be written something like This is a sentence with sp=ADammy p=ADhr=ADase. The next mime part is the html, which does the same thing, but uses ­ (html for soft hyphen) mid-word. So, something like: <p>This is a sentence with sp­ammy p­hr­ase in it</p> The whole body of the message is filled with these soft hyphens anywhere that there's spammy words/phrases, and in many cases, there are soft hyphens every couple of letters across the entire body. When I do an analysis, it appears that the soft hyphen tricks ASSP into thinking that each part of the word is a separate word, so for sp­ammy p­hr­ase, it thinks the words are sp ammy p hr ase I am using HTML::strip. Would TreeBuilder work better? I'm concerned about performance there. Is there a way (and is it a good idea) to somehow instruct ASSP to treat certain html special characters as ones to ignore, and others to be treated as a word separator? My thinking is that if it doesn't display, then it should be ignored when doing bayesian / HMM evaluation. https://cs.stanford.edu/people/miles/iso8859.html has a bunch of Control Characters and Special Characters that don't print - or in the case of the soft hyphen, only print when the contained word is at the end of a line. I suspect that other characters will be abused in the same way. This kind of obfuscation goes hand in hand with my previous questions about considering some non-Latin characters that look like Latin characters as those Latin alphabet characters. Thanks [Anhang "attz351u.txt" gelöscht von Thomas Eckardt/eck] [Anhang "att8gq15.txt" gelöscht von Thomas Eckardt/eck]
_______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test