On Wed, 18 Nov 2009 22:18:40 +0100
Alexander Prinsier <aphe...@mailhaven.com> wrote:

> On 11/18/2009 09:53 PM, Stevan Bajić wrote:
> >> Then do what you used to do for Western languages: tokenize using spaces
> >> as separators. For other languages split every 4 bytes.
> >>
> > Not going to work. You see my name? It's slavic. And I am able to write in 
> > other languages (9 of them) and in other letters (2 of them). So beside 
> > latin letters I am able to read and write in cyrillic. So here my answer 
> > for the break at 4 bytes: Тхис ис јуст нот гоинг то wорк. :)
> 
> So you mean, you can break cyrillic/slavic at spaces too like Western 
> languages? So then it'll work? You just break everything you know at 
> spaces, and what you don't know, like Chinese, at UTF32 code points.
> 
No. I did not say that. You said:
Western languages -> spaces
Non-Western languages -> bytes

And Cyrillic is a NON-WESTERN language. So the rule you mentioned is wrong. 
Cyrillic languages should break at space as well.

But let's go ahead and think about Arabic: Do they break at symbol level or at 
space level? (I don't know).

How about Thai? Hangul? I don't know what else...


> >> If you do that, you don't even have to threat English and Chinese
> >> separately.
> >>
> > You do. You should. Any language that has letters and not symbols is better 
> > broken into words at their word boundery.
> 
> I was assuming you make such big n-grams that you cover all the words 
> (breaking at the spaces) too. Yeah unrealistic, but see my note right 
> one line under this sentence :)
> 
> >> But that's just theoretical, it has a big overhead penalty.
> >>
> >> By doing a conversion to UTF32 each time you don't have to worry about
> >> the complexity of having an extra gazillion characters, they are just
> >> another number.
> >>
> > Check. But you need to handle the different conversation methods. There is 
> > no universal conversation available. You need to know the used character 
> > set in order to make the conversation the proper way. Or you have a library 
> > that does that for you.
> 
> Those libraries exist. iconv for example to convert between 'anything' 
> to UTF8.
> 
I know about iconv but I don't know if that is available everywhere, where 
DSPAM is used?


> There's also IBM's ITU (open source library) if you need something heavier.
> 
You are writing here to a IBM Business Partner. But ITU? Never heard of it in 
relation to open source library. I know ICU. Did you mean that?


> Alexander
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to