/** * CJKTokenizer was modified from StopTokenizer which does a decent job for most European * languages. and it perferm other token method for double-byte Characters: the token will * return at each two charactors with overlap match. * Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" * it also need filter filter zero length token "" * * for more info on Asia language(Chinese Japanese Korean) text segmentation: * http://www.google.com/search?q=overlap+match+chinese+segment * for Digit: the prefix digit will token: "3dmax"=>"3" "dmax"; "U2"=>"u2" * for Punc: '_' will token as a letter, '+' '#' will token as a digit * * @author Che, Dong [EMAIL PROTECTED] * @version $Id$ */
CJKTokenizer.java
CJKTokenizer.java
Description: Binary data
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>