On Wed, Nov 18, 2009 at 09:20:58PM +0100, Alexander Prinsier wrote: > Hello, > > I'm separating the discussion about handling non-Western languages here. > > One solution, which is what is used by for example xml parsers, and > other kinds of software which want to do the right thing (tm) at all > costs, is: > > Read in the message, using it's encoding-type. Html, Xml, but also email > have headers that specify what the encoding type is. Then convert each > character into it's UTF-32 codepoint. So every *each* character you > read, let it be English or Chinese, will take up 4 bytes. (yeah that has > some cpu and memory impact) > > Then do what you used to do for Western languages: tokenize using spaces > as separators. For other languages split every 4 bytes. > > If you don't care about cpu speed or complexity, you could configure the > tokenizer to not only split every 4 bytes, but every 4, 8, 12, 16, etc > bytes. If you do that, you don't even have to threat English and Chinese > separately. But that's just theoretical, it has a big overhead penalty. > > By doing a conversion to UTF32 each time you don't have to worry about > the complexity of having an extra gazillion characters, they are just > another number. For dspam, it shouldn't matter, a character is just a > 32bit codepoint, you don't have to interprete it, let a utf library do > that when you need it. > > Alexander > I thought that UTF8, UTF-16 and UTF-32 can represent all the characters. In that case, why wouldn't you use the UTF8 equivalent? At the least it would save space.
Ken ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel