Hello,

I'm separating the discussion about handling non-Western languages here.

One solution, which is what is used by for example xml parsers, and 
other kinds of software which want to do the right thing (tm) at all 
costs, is:

Read in the message, using it's encoding-type. Html, Xml, but also email 
have headers that specify what the encoding type is. Then convert each 
character into it's UTF-32 codepoint. So every *each* character you 
read, let it be English or Chinese, will take up 4 bytes. (yeah that has 
some cpu and memory impact)

Then do what you used to do for Western languages: tokenize using spaces 
as separators. For other languages split every 4 bytes.

If you don't care about cpu speed or complexity, you could configure the 
tokenizer to not only split every 4 bytes, but every 4, 8, 12, 16, etc 
bytes. If you do that, you don't even have to threat English and Chinese 
separately. But that's just theoretical, it has a big overhead penalty.

By doing a conversion to UTF32 each time you don't have to worry about 
the complexity of having an extra gazillion characters, they are just 
another number. For dspam, it shouldn't matter, a character is just a 
32bit codepoint, you don't have to interprete it, let a utf library do 
that when you need it.

Alexander

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to