On Wed, 18 Nov 2009 22:18:40 +0100 Alexander Prinsier <aphe...@mailhaven.com> wrote:
> On 11/18/2009 09:53 PM, Stevan Bajić wrote: > >> Then do what you used to do for Western languages: tokenize using spaces > >> as separators. For other languages split every 4 bytes. > >> > > Not going to work. You see my name? It's slavic. And I am able to write in > > other languages (9 of them) and in other letters (2 of them). So beside > > latin letters I am able to read and write in cyrillic. So here my answer > > for the break at 4 bytes: Тхис ис јуст нот гоинг то wорк. :) > > So you mean, you can break cyrillic/slavic at spaces too like Western > languages? So then it'll work? You just break everything you know at > spaces, and what you don't know, like Chinese, at UTF32 code points. > No. I did not say that. You said: Western languages -> spaces Non-Western languages -> bytes And Cyrillic is a NON-WESTERN language. So the rule you mentioned is wrong. Cyrillic languages should break at space as well. But let's go ahead and think about Arabic: Do they break at symbol level or at space level? (I don't know). How about Thai? Hangul? I don't know what else... > >> If you do that, you don't even have to threat English and Chinese > >> separately. > >> > > You do. You should. Any language that has letters and not symbols is better > > broken into words at their word boundery. > > I was assuming you make such big n-grams that you cover all the words > (breaking at the spaces) too. Yeah unrealistic, but see my note right > one line under this sentence :) > > >> But that's just theoretical, it has a big overhead penalty. > >> > >> By doing a conversion to UTF32 each time you don't have to worry about > >> the complexity of having an extra gazillion characters, they are just > >> another number. > >> > > Check. But you need to handle the different conversation methods. There is > > no universal conversation available. You need to know the used character > > set in order to make the conversation the proper way. Or you have a library > > that does that for you. > > Those libraries exist. iconv for example to convert between 'anything' > to UTF8. > I know about iconv but I don't know if that is available everywhere, where DSPAM is used? > There's also IBM's ITU (open source library) if you need something heavier. > You are writing here to a IBM Business Partner. But ITU? Never heard of it in relation to open source library. I know ICU. Did you mean that? > Alexander > -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel