thanks. In this case it actually looks like I was trying to solve a problem that doesn't exist (not an unusual occurrence in my experience) since the StandardAnalyzer does not appear to split the terms if the period has no white space following. I was a bit misled by the additional complication that I am using the MoreLikeThis class to construct the query, and it seemed to be dropping the SAP.FIN term, apparently because it actually never appears in my index to be searched, only in my input queries. In fact I may decide to do some acronym expansion of this to allow it to match things that *do* appear in my index.
But your point about the StandardAnalyzer being slow is well-taken, and I'll keep that in mind. Also, the straighforward substitution before indexing and searching is a reasonable approach to keep in mind. Thanks- Donna Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED] "Erick Erickson" <[EMAIL PROTECTED]> 08/09/2007 12:09 PM Please respond to java-user@lucene.apache.org To java-user@lucene.apache.org cc Subject Re: special handling of certain terms with embedded periods Some possibilities... > write your own tokenizer and/or filter. If you alter your BNF, you'll have to maintain it in later releases. > use some simple transformations for the input *before* tokenizing. > there's been some discussion that StandardAnalyzer (and, I assume, the Standard* beasts) are slower than the other analyzers, so you may be better off eschewing them. Best Erick On 8/9/07, Donna L Gresh <[EMAIL PROTECTED]> wrote: > > Is there a good way to handle the following scenario: > > I have certain terms with embedded periods for which I want to leave them > intact (not split at the periods). For > example in my application a particular skill might be SAP.FIN (SAP > financial), and it should not be split into > SAP and FIN. Is there a way to specify a list of terms such as these which > should not be split? I am > currently using my own "SynonymAnalyzer" for which the token stream looks > like below > (pretty standard I think) and where engine is a custom SynonymEngine > where I provide the synonyms. > Is there a typical way to handle this situation? > > public TokenStream tokenStream(String fieldName, Reader reader) { > > TokenStream result = new SnowballFilter( > new SynonymFilter( > new StopFilter( > new LowerCaseFilter( > new StandardFilter( > new StandardTokenizer(reader))), > StandardAnalyzer.STOP_WORDS), > engine),"English" > ); > return result; > } > > Donna L. Gresh > Services Research, Mathematical Sciences Department > IBM T.J. Watson Research Center > (914) 945-2472 > http://www.research.ibm.com/people/g/donnagresh > [EMAIL PROTECTED] >