Kevin Atkinson wrote:
> Samphan Raruenrom wrote:
> > In Thai, we don't put spaces between words at all so
> > the same situation happends naturally.
> > Typical Thai word-segmentation algorithm (which usually
> > do spelling check also) use maximal-match backtracking
> > algorithm with trie word list(s).
> > My implementation is at http://www.thai.net/libinthai/
> > IBM Classes for Unicode implementation is at
> > http://www.ibm.com/java/education/boundaries/boundaries.html
> Ok so how do you detect bonduries of unknown or misspelled words.

IBM ICU's algorithm describe in the above URL is :-
: If we exhausted our possibilities without finding 
: a valid sequence of words, it either means there's
: an error in the text, or the text includes a word 
: that isn't in the dictionary. In either case, we restore
: the set of break positions that matched the most 
: characters, advance one character past where the
: mismatch occurred in that sequence, and start over 
: from there. This works pretty well: usually only
: one or two boundary positions around the error 
: are in the wrong place.


---
Note: This message was origanlly posted to [EMAIL PROTECTED],
      not [EMAIL PROTECTED]


_______________________________________________
aspell-user mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/mailman/listinfo/aspell-user

Reply via email to