[jira] [Created] (LUCENE-5564) Currency characters are not tokenized

Jerome Lanneluc (JIRA) Tue, 01 Apr 2014 05:44:41 -0700

Jerome Lanneluc created LUCENE-5564:
---------------------------------------


             Summary: Currency characters are not tokenized
                 Key: LUCENE-5564
                 URL: https://issues.apache.org/jira/browse/LUCENE-5564
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/index
    Affects Versions: 3.6.2
            Reporter: Jerome Lanneluc


It is not possible to have the SmartChineseAnalyzer(nor the StandardAnalyzer) 
include the currency characters (e.g $ or €) in the token stream.

For example, the following will output 100 200. I would expect a way to 
configure the analyzers to output 100$ 200€ instead.

import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
public class Test {
        public static void main(String[] args) throws Exception {
                Analyzer analyzer = new 
SmartChineseAnalyzer(Version.LUCENE_36); //new 
StandardAnalyzer(Version.LUCENE_36);
                TokenStream stream = analyzer.tokenStream(null, new 
StringReader("100$ 200€"));
                while (stream.incrementToken()) {
                        CharTermAttribute attr = 
stream.getAttribute(CharTermAttribute.class);
                        System.out.print(new String(attr.buffer(), 0, 
attr.length()));
                        System.out.print(' ');
                }
        }
}




--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-5564) Currency characters are not tokenized

Reply via email to