Hello, I'm separating the discussion about handling non-Western languages here.
One solution, which is what is used by for example xml parsers, and other kinds of software which want to do the right thing (tm) at all costs, is: Read in the message, using it's encoding-type. Html, Xml, but also email have headers that specify what the encoding type is. Then convert each character into it's UTF-32 codepoint. So every *each* character you read, let it be English or Chinese, will take up 4 bytes. (yeah that has some cpu and memory impact) Then do what you used to do for Western languages: tokenize using spaces as separators. For other languages split every 4 bytes. If you don't care about cpu speed or complexity, you could configure the tokenizer to not only split every 4 bytes, but every 4, 8, 12, 16, etc bytes. If you do that, you don't even have to threat English and Chinese separately. But that's just theoretical, it has a big overhead penalty. By doing a conversion to UTF32 each time you don't have to worry about the complexity of having an extra gazillion characters, they are just another number. For dspam, it shouldn't matter, a character is just a 32bit codepoint, you don't have to interprete it, let a utf library do that when you need it. Alexander ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel