Re: special handling of certain terms with embedded periods

Donna L Gresh Thu, 09 Aug 2007 09:54:22 -0700

thanks.
In this case it actually looks like I was trying to solve a problem
that doesn't exist (not an unusual occurrence in my experience)
since the StandardAnalyzer does not appear to split the terms
if the period has no white space following. I was a bit misled by
the additional complication that I am using the MoreLikeThis
class to construct the query, and it seemed to be dropping the
SAP.FIN term, apparently because it actually never appears in
my index to be searched, only in my input queries. In fact I may
decide to do some acronym expansion of this to allow it to
match things that *do* appear in my index.


But your point about the StandardAnalyzer being slow is 
well-taken, and I'll keep that in mind. Also, the straighforward
substitution before indexing and searching is a reasonable
approach to keep in mind.

Thanks-
Donna

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]




"Erick Erickson" <[EMAIL PROTECTED]> 
08/09/2007 12:09 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: special handling of certain terms with embedded periods






Some possibilities...
> write your own tokenizer and/or filter. If you alter your BNF,
     you'll have to maintain it in later releases.
> use some simple transformations for the input *before* tokenizing.
> there's been some discussion that StandardAnalyzer (and, I assume,
   the Standard* beasts) are slower than the other analyzers, so you
   may be better off eschewing them.

Best
Erick


On 8/9/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
>
> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to leave 
them
> intact (not split at the periods). For
> example in my application a particular skill might be SAP.FIN (SAP
> financial), and it should not be split into
> SAP and FIN. Is there a way to specify a list of terms such as these 
which
> should not be split? I am
> currently using my own "SynonymAnalyzer" for which the token stream 
looks
> like below
> (pretty standard I think) and where engine is a custom SynonymEngine
> where I provide the synonyms.
> Is there a typical way to handle this situation?
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>
> TokenStream result = new SnowballFilter(
>    new SynonymFilter(
>         new StopFilter(
>            new LowerCaseFilter(
>              new StandardFilter(
>                new StandardTokenizer(reader))),
>                   StandardAnalyzer.STOP_WORDS),
>           engine),"English"
> );
> return result;
> }
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> [EMAIL PROTECTED]
>

Re: special handling of certain terms with embedded periods

Reply via email to