[ https://issues.apache.org/jira/browse/LUCENE-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jim Ferenczi resolved LUCENE-8325. ---------------------------------- Resolution: Fixed Fix Version/s: master (8.0) 7.4 I merged in master and backported to 7x. Thanks [~chengpohi] and [~rcmuir] for reviewing. > smartcn analyzer can't handle SURROGATE char > -------------------------------------------- > > Key: LUCENE-8325 > URL: https://issues.apache.org/jira/browse/LUCENE-8325 > Project: Lucene - Core > Issue Type: Bug > Reporter: chengpohi > Priority: Minor > Labels: newbie, patch > Fix For: 7.4, master (8.0) > > Attachments: handle_surrogate_char_for_smartcn_2018-05-23.patch > > > This issue is from [https://github.com/elastic/elasticsearch/issues/30739] > smartcn analyzer can't handle SURROGATE char, Example: > > > {code:java} > Analyzer ca = new SmartChineseAnalyzer(); > String sentence = "\uD862\uDE0F"; // 𨨏 a surrogate char > TokenStream tokenStream = ca.tokenStream("", sentence); > CharTermAttribute charTermAttribute = > tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while (tokenStream.incrementToken()) { > String term = charTermAttribute.toString(); > System.out.println(term); > } > {code} > > In the above code snippet will output: > > {code:java} > ? > ? > {code} > > and I have created a *PATCH* to try to fix this, please help review(since > *smartcn* only support *GBK* char, so it's only just handle it as a *single > char*). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org