Ted Zlatanov <[EMAIL PROTECTED]> writes: > On Thu, 06 Nov 2008 10:34:25 +0100 Michal Nazarewicz <[EMAIL PROTECTED]> > wrote: > > MN> "<<" and ">>" have codes U+00AB and U+00BB so that's why they match but > MN> there are plenty of other characters which may show up in an English > MN> text, like (I'll use a (sequence of) ASCII characters which resembles > MN> the proper unicode character) "`" (U+2018), "'" (U+2019), "``" (U+201C) > MN> , "''" (U+201D) or "..." (U+2026) which will cause the entry to be > MN> filtered out. > > Agreed. It's not an easy problem without Unicode properties, but for > the *subject* of the message it's a passable heuristic. > > MN> Besides, I think what you really meant was: > > MN> (string-match "[^\\0-\\177]" "string") > > MN> since "1ff" is not a valid octal number. > > Yes. Sorry. > > MN> I think that taking the title of the entry and checking if at least 90% > MN> are ASCII characters would be sufficient to filter out Asian texts. You > MN> can also try taking first 100 (or so) characters of the body. I think > MN> you could use replace-regexp-in-string for that purpose: > > MN> (defun mn-non-english-p (string) > MN> (> > MN> (* (length (replace-regexp-in-string "[^\\0-\\77]" "" string)) 10) > MN> (* (length string) 9))) > > That might work, but for a score file a simple regular expression is > better, and I understood the OP to need a score file.
Score files are great. Truth be told, I'm just looking for what works. I like your solution but it will exclude posts with unicode characters, which is something I would like to avoid if possible. Thanks, rdc -- Robert D. Crawford [EMAIL PROTECTED] semper en excretus _______________________________________________ info-gnus-english mailing list [email protected] http://lists.gnu.org/mailman/listinfo/info-gnus-english
