Henrik K wrote, On 17/12/09 7:37 PM:
Justin are you the only one who knows about TextCat? Have you looked at it?

I was involved with it when we first ported it to SpamAssassin, but its been years since I looked at it. I think that I may be the person most familiar with it, though. I'm afraid that I didn't notice that bug in the database.

Uppercase characters are a tricky problem that had not occurred to me. If textcat is going to recognize languages in multibyte charsets without trying to do any kind of charset decoding, then it can't lowercase all the characters as if it is assuming that they are Roman ASCII. Unless we train it on all-uppercase English as a separate language, it won't recognize it as the English that it trained on.

I guess any more comments on the bug itself ought to be placed in Bugzilla. This mailing list is a fair place to discuss whether it should be considered a blocker for 3.3.0. Personally, I don't think it is. It may be the case that it is a deficiency in using Texcat in SpamAssassin, but it is one that it has always had, among others. It would be good if someone came up with a way for it to be smarter about charsets, but I don't think that can happen in the 3.3.0 time frame.

 -- sidney

Reply via email to