Is there a good way to handle the following scenario:
I have certain terms with embedded periods for which I want to leave them
intact (not split at the periods). For
example in my application a particular skill might be SAP.FIN (SAP
financial), and it should not be split into
SAP and FIN. Is there a way to specify a list of terms such as these which
should not be split? I am
currently using my own "SynonymAnalyzer" for which the token stream looks
like below
(pretty standard I think) and where engine is a custom SynonymEngine
where I provide the synonyms.
Is there a typical way to handle this situation?
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SnowballFilter(
new SynonymFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader))),
StandardAnalyzer.STOP_WORDS),
engine),"English"
);
return result;
}
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]