chengpohi created LUCENE-8325:
---------------------------------

             Summary: smartcn analyzer can't handle SURROGATE char
                 Key: LUCENE-8325
                 URL: https://issues.apache.org/jira/browse/LUCENE-8325
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: chengpohi
         Attachments: handle-surrogate-char-for-smartcn.patch

This issue is from [smartcn_tokenizer 
...](https://github.com/elastic/elasticsearch/issues/30739)

smartcn analyzer can't handle SURROGATE char, Example:

 

 
{code:java}
Analyzer ca = new SmartChineseAnalyzer(); 
String sentence = "\uD862\uDE0F"; // 𨨏 a surrogate char 
TokenStream tokenStream = ca.tokenStream("", sentence); 
CharTermAttribute charTermAttribute = 
tokenStream.addAttribute(CharTermAttribute.class); 
tokenStream.reset(); 
while (tokenStream.incrementToken()) { 
    String term = charTermAttribute.toString(); 
    System.out.println(term); 
} 
{code}
 

In the above code snippet will output: 

 
{code:java}
? 
? 
{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to