On Wed, Nov 18, 2009 at 09:20:58PM +0100, Alexander Prinsier wrote:
> Hello,
> 
> I'm separating the discussion about handling non-Western languages here.
> 
> One solution, which is what is used by for example xml parsers, and 
> other kinds of software which want to do the right thing (tm) at all 
> costs, is:
> 
> Read in the message, using it's encoding-type. Html, Xml, but also email 
> have headers that specify what the encoding type is. Then convert each 
> character into it's UTF-32 codepoint. So every *each* character you 
> read, let it be English or Chinese, will take up 4 bytes. (yeah that has 
> some cpu and memory impact)
> 
> Then do what you used to do for Western languages: tokenize using spaces 
> as separators. For other languages split every 4 bytes.
> 
> If you don't care about cpu speed or complexity, you could configure the 
> tokenizer to not only split every 4 bytes, but every 4, 8, 12, 16, etc 
> bytes. If you do that, you don't even have to threat English and Chinese 
> separately. But that's just theoretical, it has a big overhead penalty.
> 
> By doing a conversion to UTF32 each time you don't have to worry about 
> the complexity of having an extra gazillion characters, they are just 
> another number. For dspam, it shouldn't matter, a character is just a 
> 32bit codepoint, you don't have to interprete it, let a utf library do 
> that when you need it.
> 
> Alexander
> 
I thought that UTF8, UTF-16 and UTF-32 can represent all the characters.
In that case, why wouldn't you use the UTF8 equivalent? At the least it
would save space.

Ken

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to