Eager to see what you come up with in terms of ignoring the soft hyphen. Your <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> regex is clear, and I understand using that for scoring purposes, but I'm looking for a way to potentially treat look-alike characters as the latin character for bayesian purposes and/or to catch commonly obscured words (like GeekSquad). Is it okay if I reply further in my August 1 post here to keep that in the same thread?
On Tue, Sep 6, 2022 at 2:06 PM Thomas Eckardt <thomas.ecka...@thockar.com> wrote: > >HTML::strip > > html parsing to get text parts has nothing to do with html de(en)coding > > > >iso-8559-1 > ASSP processes all content as UTF-8 > > > >­ > ASSP is aware about this - and replaces soft-hyphens with hard-hyphens - > and multiple concurrent hard-hyphens with a single one > How ever - the option to remove the soft-hyphens instead, sounds somehow > better. Tests are still running. > > >My thinking is that if it doesn't display..... > ASSP does'nt know if something displayed or not (and will never know it) > > > >I suspect that other characters will be abused in the same way > as well as several BIG5, numerical and other unicode characters are > already special handled by assp. Other CTL-chars are ignored by assp. > Everything is converted to UTF8, unicode normalized (including grapheme > clusters), stemmed and simplyfied. > > > >This kind of obfuscation goes hand in hand with my previous questions > about considering some non-Latin characters that look like Latin characters > as those Latin alphabet characters. > > With some unicode knowledge, some help from the analyzer and some regex > knowledge - such things are easy to find > for example : <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> > finds a sequence where cyrillic (a p b ....) are used in words - commonly > used by spammers > > Thomas > > > > Von: "K Post" <nntp.p...@gmail.com> > An: "ASSP development mailing list" < > assp-test@lists.sourceforge.net> > Datum: 06.09.2022 16:16 > Betreff: [Assp-test] soft hyphen fooling Bayesian analysis > ------------------------------ > > > > > Is there a way to improve the way that ASSP parses certain special, > non-printing, characters? I'm having trouble with spam emails that have > their body heavily obfuscated with "soft hyphens" slipping through. They > all seem to have multipart bodies, first with an iso-8559-1 text part with > *=AD* interterspersed in words and then an html part with *­* all > over the place. These are the "soft hyphen," a hyphen that only prints if > it is needed to break the word to the next line. It's clever. The user > doesn't see the character, but ASSP thinks it's a word boundary. > > The part first part > Content-Type: text/plain; charset="*iso-8859-1*" > Content-Transfer-Encoding: quoted-printable > will be plain text, and have have spammy words with *=AD* inserted in the > middle of them, for example, "This is a sentence with spammy phrase." could > be written something like > This is a sentence with sp=ADammy p=ADhr=ADase. > > The next mime part is the html, which does the same thing, but uses ­ > (html for soft hyphen) mid-word. So, something like: > <p>This is a sentence with sp­ammy p­hr­ase in it</p> > > The whole body of the message is filled with these soft hyphens anywhere > that there's spammy words/phrases, and in many cases, there are soft > hyphens every couple of letters across the entire body. When I do an > analysis, it appears that the soft hyphen tricks ASSP into thinking that > each part of the word is a separate word, so for sp­ammy > p­hr­ase, it thinks the words are > sp ammy p hr ase > > I am using HTML::strip. Would TreeBuilder work better? I'm concerned > about performance there. > > Is there a way (and is it a good idea) to somehow instruct ASSP to treat > certain html special characters as ones to ignore, and others to be treated > as a word separator? My thinking is that if it doesn't display, then it > should be ignored when doing bayesian / HMM evaluation. > > *https://cs.stanford.edu/people/miles/iso8859.html* > <https://cs.stanford.edu/people/miles/iso8859.html> has a bunch of > Control Characters and Special Characters that don't print - or in the case > of the soft hyphen, only print when the contained word is at the end of a > line. I suspect that other characters will be abused in the same way. > > This kind of obfuscation goes hand in hand with my previous questions > about considering some non-Latin characters that look like Latin characters > as those Latin alphabet characters. > > Thanks > > > > > > [Anhang "attz351u.txt" gelöscht von Thomas Eckardt/eck] [Anhang > "att8gq15.txt" gelöscht von Thomas Eckardt/eck] > > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test >
_______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test