Hi, > > What you intend to do is not a "stopword" use case. You want to "ignore" > some words - Lucene has no support for this, because in native language > processing this makes no sense. > > Thank you for the information. I was unaware that ignoring some words > "makes no sense". I thought I gave a reasonable example of exactly this > situation in the native processing of Tibetan. Perhaps I am still not > understanding.
Elisions are a bit different than stopwords (although I don't know about them in Tibet language). The Tokenizer should *not* split Elisions from the terms (initially the term is the full word including the elision). In most languages those are separated by (for example) an apostrophe (e.g. French: le + arbre → l’arbre). The Tokenizer would keep those parts together (l’arbre). A later TokenFilter would then edit the token and remove the elision (if needed): arbre. This is how the French Analyzer in Lucene works. Lucene currently does not have Tibetanian Analyzer, so you have to make your own one (I think this is what you tried to do). You should carefully choose the Tokenizer and add something like an TibetanElisionFilter that removes the not wanted parts from the tokens. Uwe --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org