Re: special handling of certain terms with embedded periods

Erick Erickson Thu, 09 Aug 2007 09:09:42 -0700

Some possibilities...
> write your own tokenizer and/or filter. If you alter your BNF,
     you'll have to maintain it in later releases.
> use some simple transformations for the input *before* tokenizing.
> there's been some discussion that StandardAnalyzer (and, I assume,
   the Standard* beasts) are slower than the other analyzers, so you
   may be better off eschewing them.


Best
Erick


On 8/9/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
>
> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to leave them
> intact (not split at the periods). For
> example in my application a particular skill might be SAP.FIN (SAP
> financial), and it should not be split into
> SAP and FIN. Is there a way to specify a list of terms such as these which
> should not be split? I am
> currently using my own "SynonymAnalyzer" for which the token stream looks
> like below
> (pretty standard I think) and where engine is a custom SynonymEngine
> where I provide the synonyms.
> Is there a typical way to handle this situation?
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>
> TokenStream result = new SnowballFilter(
>    new SynonymFilter(
>         new StopFilter(
>            new LowerCaseFilter(
>              new StandardFilter(
>                new StandardTokenizer(reader))),
>                   StandardAnalyzer.STOP_WORDS),
>           engine),"English"
> );
> return result;
> }
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> [EMAIL PROTECTED]
>

Re: special handling of certain terms with embedded periods

Reply via email to