On Thursday, December 17, 2009, Sidney Markowitz <[email protected]> wrote: > Henrik K wrote, On 17/12/09 7:37 PM: > > Justin are you the only one who knows about TextCat? Have you looked at it? > > > I was involved with it when we first ported it to SpamAssassin, but its been > years since I looked at it. I think that I may be the person most familiar > with it, though. I'm afraid that I didn't notice that bug in the database. > > Uppercase characters are a tricky problem that had not occurred to me. If > textcat is going to recognize languages in multibyte charsets without trying > to do any kind of charset decoding, then it can't lowercase all the > characters as if it is assuming that they are Roman ASCII. Unless we train it > on all-uppercase English as a separate language, it won't recognize it as the > English that it trained on. > > I guess any more comments on the bug itself ought to be placed in Bugzilla. > This mailing list is a fair place to discuss whether it should be considered > a blocker for 3.3.0. Personally, I don't think it is. It may be the case that > it is a deficiency in using Texcat in SpamAssassin, but it is one that it has > always had, among others. It would be good if someone came up with a way for > it to be smarter about charsets, but I don't think that can happen in the > 3.3.0 time frame.
+1 > > -- sidney > > -- --j.
