[ https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721489#comment-15721489 ]
peina commented on LUCENE-7509: ------------------------------- Thanks. Make sense to me. > [smartcn] Some chinese text is not tokenized correctly with Chinese > punctuation marks appended > ---------------------------------------------------------------------------------------------- > > Key: LUCENE-7509 > URL: https://issues.apache.org/jira/browse/LUCENE-7509 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 6.2.1 > Environment: Mac OS X 10.10 > Reporter: peina > Labels: chinese, tokenization > > Some chinese text is not tokenized correctly with Chinese punctuation marks > appended. > e.g. > 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct. > But > 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠, > The similar case happens when text with numbers appended. > e.g. > 生活报8月4号 -->生活|报|8|月|4|号 > 生活报-->生活报 > Test Sample: > public static void main(String[] args) throws IOException{ > Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */ > System.out.println("Sample1======="); > String sentence = "生活报8月4号"; > printTokens(analyzer, sentence); > sentence = "生活报"; > printTokens(analyzer, sentence); > System.out.println("Sample2======="); > > sentence = "碧绿的眼珠,"; > printTokens(analyzer, sentence); > sentence = "碧绿的眼珠"; > printTokens(analyzer, sentence); > > analyzer.close(); > } > private static void printTokens(Analyzer analyzer, String sentence) throws > IOException{ > System.out.println("sentence:" + sentence); > TokenStream tokens = analyzer.tokenStream("dummyfield", sentence); > tokens.reset(); > CharTermAttribute termAttr = (CharTermAttribute) > tokens.getAttribute(CharTermAttribute.class); > while (tokens.incrementToken()) { > System.out.println(termAttr.toString()); > } > tokens.close(); > } > Output: > Sample1======= > sentence:生活报8月4号 > 生活 > 报 > 8 > 月 > 4 > 号 > sentence:生活报 > 生活报 > Sample2======= > sentence:碧绿的眼珠, > 碧绿 > 的 > 眼 > 珠 > sentence:碧绿的眼珠 > 碧绿 > 的 > 眼珠 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org