Jerome Lanneluc created LUCENE-5564:
---------------------------------------
Summary: Currency characters are not tokenized
Key: LUCENE-5564
URL: https://issues.apache.org/jira/browse/LUCENE-5564
Project: Lucene - Core
Issue Type: Bug
Components: core/index
Affects Versions: 3.6.2
Reporter: Jerome Lanneluc
It is not possible to have the SmartChineseAnalyzer(nor the StandardAnalyzer)
include the currency characters (e.g $ or €) in the token stream.
For example, the following will output 100 200. I would expect a way to
configure the analyzers to output 100$ 200€ instead.
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
public class Test {
public static void main(String[] args) throws Exception {
Analyzer analyzer = new
SmartChineseAnalyzer(Version.LUCENE_36); //new
StandardAnalyzer(Version.LUCENE_36);
TokenStream stream = analyzer.tokenStream(null, new
StringReader("100$ 200€"));
while (stream.incrementToken()) {
CharTermAttribute attr =
stream.getAttribute(CharTermAttribute.class);
System.out.print(new String(attr.buffer(), 0,
attr.length()));
System.out.print(' ');
}
}
}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]