Re: [Dspam-devel] Quick comment on non Western languages

Stevan Bajić Wed, 18 Nov 2009 12:56:27 -0800

On Wed, 18 Nov 2009 21:20:58 +0100
Alexander Prinsier <[email protected]> wrote:


> Hello,
> 
Hallo Alexander,


> I'm separating the discussion about handling non-Western languages here.
> 
> One solution, which is what is used by for example xml parsers, and 
> other kinds of software which want to do the right thing (tm) at all 
> costs, is:
> 
> Read in the message, using it's encoding-type. Html, Xml, but also email 
> have headers that specify what the encoding type is. Then convert each 
> character into it's UTF-32 codepoint.
>
Check.


> So every *each* character you 
> read, let it be English or Chinese, will take up 4 bytes. (yeah that has 
> some cpu and memory impact)
> 
Check.


> Then do what you used to do for Western languages: tokenize using spaces 
> as separators. For other languages split every 4 bytes.
> 
Not going to work. You see my name? It's slavic. And I am able to write in 
other languages (9 of them) and in other letters (2 of them). So beside latin 
letters I am able to read and write in cyrillic. So here my answer for the 
break at 4 bytes: Тхис ис јуст нот гоинг то wорк. :)


> If you don't care about cpu speed or complexity, you could configure the 
> tokenizer to not only split every 4 bytes, but every 4, 8, 12, 16, etc 
> bytes.
>
Check. Everything past 4 characters (4 x 4 bytes) is pointless (according to 
the documents I have read for how to break Asian languages). Everything other 
then SBPH and OSB would profit from producing more n-grams. SBPH and OSB would 
not much profit from more n-grams.


> If you do that, you don't even have to threat English and Chinese 
> separately.
>
You do. You should. Any language that has letters and not symbols is better 
broken into words at their word boundery.


> But that's just theoretical, it has a big overhead penalty.
> 
> By doing a conversion to UTF32 each time you don't have to worry about 
> the complexity of having an extra gazillion characters, they are just 
> another number.
>
Check. But you need to handle the different conversation methods. There is no 
universal conversation available. You need to know the used character set in 
order to make the conversation the proper way. Or you have a library that does 
that for you.


> For dspam, it shouldn't matter, a character is just a 
> 32bit codepoint, you don't have to interprete it, let a utf library do 
> that when you need it.
> 
The character is not my problem. The separator is. What is separating 
characters/symbols from words? That is the point of the whole discussion.


> Alexander
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Re: [Dspam-devel] Quick comment on non Western languages

Reply via email to