On 11/18/2009 09:53 PM, Stevan Bajić wrote:
>> Then do what you used to do for Western languages: tokenize using spaces
>> as separators. For other languages split every 4 bytes.
>>
> Not going to work. You see my name? It's slavic. And I am able to write in 
> other languages (9 of them) and in other letters (2 of them). So beside latin 
> letters I am able to read and write in cyrillic. So here my answer for the 
> break at 4 bytes: Тхис ис јуст нот гоинг то wорк. :)

So you mean, you can break cyrillic/slavic at spaces too like Western 
languages? So then it'll work? You just break everything you know at 
spaces, and what you don't know, like Chinese, at UTF32 code points.

>> If you do that, you don't even have to threat English and Chinese
>> separately.
>>
> You do. You should. Any language that has letters and not symbols is better 
> broken into words at their word boundery.

I was assuming you make such big n-grams that you cover all the words 
(breaking at the spaces) too. Yeah unrealistic, but see my note right 
one line under this sentence :)

>> But that's just theoretical, it has a big overhead penalty.
>>
>> By doing a conversion to UTF32 each time you don't have to worry about
>> the complexity of having an extra gazillion characters, they are just
>> another number.
>>
> Check. But you need to handle the different conversation methods. There is no 
> universal conversation available. You need to know the used character set in 
> order to make the conversation the proper way. Or you have a library that 
> does that for you.

Those libraries exist. iconv for example to convert between 'anything' 
to UTF8.

There's also IBM's ITU (open source library) if you need something heavier.

Alexander

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to