On Wed, 18 Nov 2009 21:20:58 +0100 Alexander Prinsier <aphe...@mailhaven.com> wrote:
> Hello, > Hallo Alexander, > I'm separating the discussion about handling non-Western languages here. > > One solution, which is what is used by for example xml parsers, and > other kinds of software which want to do the right thing (tm) at all > costs, is: > > Read in the message, using it's encoding-type. Html, Xml, but also email > have headers that specify what the encoding type is. Then convert each > character into it's UTF-32 codepoint. > Check. > So every *each* character you > read, let it be English or Chinese, will take up 4 bytes. (yeah that has > some cpu and memory impact) > Check. > Then do what you used to do for Western languages: tokenize using spaces > as separators. For other languages split every 4 bytes. > Not going to work. You see my name? It's slavic. And I am able to write in other languages (9 of them) and in other letters (2 of them). So beside latin letters I am able to read and write in cyrillic. So here my answer for the break at 4 bytes: Тхис ис јуст нот гоинг то wорк. :) > If you don't care about cpu speed or complexity, you could configure the > tokenizer to not only split every 4 bytes, but every 4, 8, 12, 16, etc > bytes. > Check. Everything past 4 characters (4 x 4 bytes) is pointless (according to the documents I have read for how to break Asian languages). Everything other then SBPH and OSB would profit from producing more n-grams. SBPH and OSB would not much profit from more n-grams. > If you do that, you don't even have to threat English and Chinese > separately. > You do. You should. Any language that has letters and not symbols is better broken into words at their word boundery. > But that's just theoretical, it has a big overhead penalty. > > By doing a conversion to UTF32 each time you don't have to worry about > the complexity of having an extra gazillion characters, they are just > another number. > Check. But you need to handle the different conversation methods. There is no universal conversation available. You need to know the used character set in order to make the conversation the proper way. Or you have a library that does that for you. > For dspam, it shouldn't matter, a character is just a > 32bit codepoint, you don't have to interprete it, let a utf library do > that when you need it. > The character is not my problem. The separator is. What is separating characters/symbols from words? That is the point of the whole discussion. > Alexander > -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel