Some possibilities... > write your own tokenizer and/or filter. If you alter your BNF, you'll have to maintain it in later releases. > use some simple transformations for the input *before* tokenizing. > there's been some discussion that StandardAnalyzer (and, I assume, the Standard* beasts) are slower than the other analyzers, so you may be better off eschewing them.
Best Erick On 8/9/07, Donna L Gresh <[EMAIL PROTECTED]> wrote: > > Is there a good way to handle the following scenario: > > I have certain terms with embedded periods for which I want to leave them > intact (not split at the periods). For > example in my application a particular skill might be SAP.FIN (SAP > financial), and it should not be split into > SAP and FIN. Is there a way to specify a list of terms such as these which > should not be split? I am > currently using my own "SynonymAnalyzer" for which the token stream looks > like below > (pretty standard I think) and where engine is a custom SynonymEngine > where I provide the synonyms. > Is there a typical way to handle this situation? > > public TokenStream tokenStream(String fieldName, Reader reader) { > > TokenStream result = new SnowballFilter( > new SynonymFilter( > new StopFilter( > new LowerCaseFilter( > new StandardFilter( > new StandardTokenizer(reader))), > StandardAnalyzer.STOP_WORDS), > engine),"English" > ); > return result; > } > > Donna L. Gresh > Services Research, Mathematical Sciences Department > IBM T.J. Watson Research Center > (914) 945-2472 > http://www.research.ibm.com/people/g/donnagresh > [EMAIL PROTECTED] >