Wayne Xin created LUCENE-6736:
---------------------------------

             Summary: SmartChineseAnalyzer chops English tokens in a 
chinese-english mixed sentence.
                 Key: LUCENE-6736
                 URL: https://issues.apache.org/jira/browse/LUCENE-6736
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 5.1
         Environment: linux Java 1.7
            Reporter: Wayne Xin


I am new with Lucene Analyzer. The following code has predefined the sentence 
in "testStr":
                String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马林first 
seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不过成池铉先要过日本小将(Japanese 
player)奥原希望这关。下半区,6号种子王仪涵若想晋级决赛secure position. congratulations.";

 The printed tokenized result is:

 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 
李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 
奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul

As you can see some long English tokens such as Japanese, position and 
congratulations are cut short in the tokenization process. I hope I didn't use 
it wrong.

Test code:

        private static void testChineseTokenizer() {
                String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马林first 
seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不过成池铉先要过日本小将(Japanese 
player)奥原希望这关。下半区,6号种子王仪涵若想晋级决赛secure position. congratulations.";
                Analyzer analyzer = new SmartChineseAnalyzer();
        List<String> result = new ArrayList<String>();
        StringReader sr = new StringReader(testStr);

        try {
            TokenStream stream  = analyzer.tokenStream(null,sr);
            CharTermAttribute cattr = 
stream.addAttribute(CharTermAttribute.class);
            stream.reset();
           while (stream.incrementToken()) {
                String token = cattr.toString();
                result.add(token);
            }
            stream.end();
            stream.close();
            sr.close();
            analyzer.close();
            stream = null;
            for (String tok: result) {
                System.out.print(" " + tok);
            }
            System.out.println();
        }
        catch(IOException e) {
            // not thrown b/c we're using a string reader...
        }
        }




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to