: a) index the documents by wrapping the whitespace analyzer with : ngramanalyzerwrapper and then retrieving only the words which have 3 or more : characters and start with a capital, filtering the "garbage" manually. : b) creating my own analyzer which will only index ngrams that start with : capital letters and then retrieving the indexed words.
: how would i go about creating my own analyzer? (i've read lucene in action : and it wasn't much help :s) Start by writing yourself a "NamedEntityTokenFilter" ... look at the StopFilter to give yourself an idea what it should look like ... whenever someone calls "next()" on your filter, keep calling "next() on whatever TokenStream you've got, untill you get something you consider a "named entity" and then return it. An Analyzer is any class which takes in an InputStream and outputs Tokens ... typically they are really really simple and just delegate the hard work to a Tokenizer and 0 or more TokenFilters ... if you look at the source code for the "tokenStream" method of most analyzers in Lucene you'll see it can be really easy to write one by reusing an existing Tokenizer (it sounds like you want to tokenize on whitespace) and your new TokenFilter. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]