On 11/18/2009 09:53 PM, Stevan Bajić wrote: >> Then do what you used to do for Western languages: tokenize using spaces >> as separators. For other languages split every 4 bytes. >> > Not going to work. You see my name? It's slavic. And I am able to write in > other languages (9 of them) and in other letters (2 of them). So beside latin > letters I am able to read and write in cyrillic. So here my answer for the > break at 4 bytes: Тхис ис јуст нот гоинг то wорк. :)
So you mean, you can break cyrillic/slavic at spaces too like Western languages? So then it'll work? You just break everything you know at spaces, and what you don't know, like Chinese, at UTF32 code points. >> If you do that, you don't even have to threat English and Chinese >> separately. >> > You do. You should. Any language that has letters and not symbols is better > broken into words at their word boundery. I was assuming you make such big n-grams that you cover all the words (breaking at the spaces) too. Yeah unrealistic, but see my note right one line under this sentence :) >> But that's just theoretical, it has a big overhead penalty. >> >> By doing a conversion to UTF32 each time you don't have to worry about >> the complexity of having an extra gazillion characters, they are just >> another number. >> > Check. But you need to handle the different conversation methods. There is no > universal conversation available. You need to know the used character set in > order to make the conversation the proper way. Or you have a library that > does that for you. Those libraries exist. iconv for example to convert between 'anything' to UTF8. There's also IBM's ITU (open source library) if you need something heavier. Alexander ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel