If unicode normalization NFKC does'nt fulfill your requirement, you may enable 'DoTransliterate' - by accepting some performance penalties.
The "Unicode Technical Standard #39" http://www.unicode.org/reports/tr39/ will give you some more information and https://www.unicode.org/Public/security/revision-05/intentional.txt shows a nice table for cyrillic and greek. If someone expects an ASCII mail, those translations may somehow help. But in all other cases (100% cyrillic/greek/....), such a character replacement is contra-productive (for example: not all cyrillic letters have a valid latin replacement). > potentially treat look-alike characters as the latin character for bayesian purposes The HMM and Bayesian engines are using heuristic mechanism. Trying to treat single characters as latin (or anything else) will not worth the effort. Over a short periode of time, both engines will have learned also obscured words (word combinations). Thomas Von: "K Post" <nntp.p...@gmail.com> An: "ASSP development mailing list" <assp-test@lists.sourceforge.net> Datum: 06.09.2022 21:31 Betreff: Re: [Assp-test] soft hyphen fooling Bayesian analysis Eager to see what you come up with in terms of ignoring the soft hyphen. Your <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> regex is clear, and I understand using that for scoring purposes, but I'm looking for a way to potentially treat look-alike characters as the latin character for bayesian purposes and/or to catch commonly obscured words (like GeekSquad). Is it okay if I reply further in my August 1 post here to keep that in the same thread? On Tue, Sep 6, 2022 at 2:06 PM Thomas Eckardt <thomas.ecka...@thockar.com> wrote: >HTML::strip html parsing to get text parts has nothing to do with html de(en)coding >iso-8559-1 ASSP processes all content as UTF-8 >­ ASSP is aware about this - and replaces soft-hyphens with hard-hyphens - and multiple concurrent hard-hyphens with a single one How ever - the option to remove the soft-hyphens instead, sounds somehow better. Tests are still running. >My thinking is that if it doesn't display..... ASSP does'nt know if something displayed or not (and will never know it) >I suspect that other characters will be abused in the same way as well as several BIG5, numerical and other unicode characters are already special handled by assp. Other CTL-chars are ignored by assp. Everything is converted to UTF8, unicode normalized (including grapheme clusters), stemmed and simplyfied. >This kind of obfuscation goes hand in hand with my previous questions about considering some non-Latin characters that look like Latin characters as those Latin alphabet characters. With some unicode knowledge, some help from the analyzer and some regex knowledge - such things are easy to find for example : <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> finds a sequence where cyrillic (a p b ....) are used in words - commonly used by spammers Thomas Von: "K Post" <nntp.p...@gmail.com> An: "ASSP development mailing list" < assp-test@lists.sourceforge.net> Datum: 06.09.2022 16:16 Betreff: [Assp-test] soft hyphen fooling Bayesian analysis Is there a way to improve the way that ASSP parses certain special, non-printing, characters? I'm having trouble with spam emails that have their body heavily obfuscated with "soft hyphens" slipping through. They all seem to have multipart bodies, first with an iso-8559-1 text part with =AD interterspersed in words and then an html part with ­ all over the place. These are the "soft hyphen," a hyphen that only prints if it is needed to break the word to the next line. It's clever. The user doesn't see the character, but ASSP thinks it's a word boundary. The part first part Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable will be plain text, and have have spammy words with =AD inserted in the middle of them, for example, "This is a sentence with spammy phrase." could be written something like This is a sentence with sp=ADammy p=ADhr=ADase. The next mime part is the html, which does the same thing, but uses ­ (html for soft hyphen) mid-word. So, something like: <p>This is a sentence with sp­ammy p­hr­ase in it</p> The whole body of the message is filled with these soft hyphens anywhere that there's spammy words/phrases, and in many cases, there are soft hyphens every couple of letters across the entire body. When I do an analysis, it appears that the soft hyphen tricks ASSP into thinking that each part of the word is a separate word, so for sp­ammy p­hr­ase, it thinks the words are sp ammy p hr ase I am using HTML::strip. Would TreeBuilder work better? I'm concerned about performance there. Is there a way (and is it a good idea) to somehow instruct ASSP to treat certain html special characters as ones to ignore, and others to be treated as a word separator? My thinking is that if it doesn't display, then it should be ignored when doing bayesian / HMM evaluation. https://cs.stanford.edu/people/miles/iso8859.html has a bunch of Control Characters and Special Characters that don't print - or in the case of the soft hyphen, only print when the contained word is at the end of a line. I suspect that other characters will be abused in the same way. This kind of obfuscation goes hand in hand with my previous questions about considering some non-Latin characters that look like Latin characters as those Latin alphabet characters. Thanks [Anhang "attz351u.txt" gelöscht von Thomas Eckardt/eck] [Anhang "att8gq15.txt" gelöscht von Thomas Eckardt/eck] _______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test[Anhang "att8rbj5.txt" gelöscht von Thomas Eckardt/eck] [Anhang "atthrsos.txt" gelöscht von Thomas Eckardt/eck]
_______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test