On Wednesday 13 September 2006 15:41, Miles Efron wrote:
> This question surely shows how new I am to Lucene... but I'm interested 
> in removing terms from a lucene index.  In particular, I'd like to be 
> able to delete all terms that appear in fewer than x documents (say 
> x=3).  This is in efforts to reduce the feature set for some research 
> I'm doing.
> 
> I found a post to this effect on the list from a while back:
>     http://www.gossamer-threads.com/lists/lucene/java-user/9538#9538
> but I couldn't find any responses to it.
> 
> The only thing I can think of is to re-index the collection, using the 
> undesired words as a sort of stoplist.  But surely there's a better way 
> to do it (the inverted index structure seems like this should be 
> natural).  Any pointers would be most helpful.

You only need to reindex the documents containing the terms to
be removed. That may not be really straightforward to program,
but it could be worthwhile depending how often you expect this to
happen.
In case you already have an incremental indexer that adds
documents that are not already in the index, you can use that after
deleting all docs containing your terms, iirc IndexReader has a
method to delete all docs containing a term.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to