RE: What is the proper use of stop words in Lucene?

Uwe Schindler Mon, 28 Apr 2014 13:37:30 -0700

Hi,

> > What you intend to do is not a "stopword" use case. You want to "ignore"
> some words - Lucene has no support for this, because in native language
> processing this makes no sense.
> 
> Thank you for the information. I was unaware that ignoring some words
> "makes no sense". I thought I gave a reasonable example of exactly this
> situation in the native processing of Tibetan. Perhaps I am still not
> understanding.


Elisions are a bit different than stopwords (although I don't know about them 
in Tibet language). The Tokenizer should *not* split Elisions from the terms 
(initially the term is the full word including the elision). In most languages 
those are separated by (for example) an apostrophe (e.g. French: le + arbre → 
l’arbre). The Tokenizer would keep those parts together (l’arbre). A later 
TokenFilter would then edit the token and remove the elision (if needed): 
arbre. This is how the French Analyzer in Lucene works.

Lucene currently does not have Tibetanian Analyzer, so you have to make your 
own one (I think this is what you tried to do). You should carefully choose the 
Tokenizer and add something like an TibetanElisionFilter that removes the not 
wanted parts from the tokens.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: What is the proper use of stop words in Lucene?

Reply via email to