On Wednesday 13 September 2006 15:41, Miles Efron wrote: > This question surely shows how new I am to Lucene... but I'm interested > in removing terms from a lucene index. In particular, I'd like to be > able to delete all terms that appear in fewer than x documents (say > x=3). This is in efforts to reduce the feature set for some research > I'm doing. > > I found a post to this effect on the list from a while back: > http://www.gossamer-threads.com/lists/lucene/java-user/9538#9538 > but I couldn't find any responses to it. > > The only thing I can think of is to re-index the collection, using the > undesired words as a sort of stoplist. But surely there's a better way > to do it (the inverted index structure seems like this should be > natural). Any pointers would be most helpful.
You only need to reindex the documents containing the terms to be removed. That may not be really straightforward to program, but it could be worthwhile depending how often you expect this to happen. In case you already have an incremental indexer that adds documents that are not already in the index, you can use that after deleting all docs containing your terms, iirc IndexReader has a method to delete all docs containing a term. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]