An easier way to do this (after some digging) is to use:

bin/nutch org.apache.nutch.tools.PruneIndexTool

You would first need to stop the DistributedSearch$Server, run the tool, which has a dryrun mode as well, then restart the server. Another more brute force way to do this if your indexes are in the form part-00000 is to delete an entire part-xxxxx. The prune tool will need to be run on each part-xxxxx within a single shard.

Be aware that this will not stop urls from coming back when content is reindexed, it will only remove them from the current index.

Dennis

John Mendenhall wrote:
Anybody know how to delete an index document in a distributed search server? Is that even possible?

I will assume by index document, you are
referring to a document that has been indexed.
If not, delete and forget.

When we need to remove a document, we go through
the process of filtering out the document by
using the following procedure:

1. build temporary nutch configuration directory
     build special filter files based on document(s) to be filtered out
     point NUTCH_CONF_DIR env var to temporary nutch configuration directory
2. run bin/nutch mergedb $NEWCRAWLDBDIR $CRAWLDBDIR -filter
3. run bin/nutch mergesegs $NEWSEGMENTSDIR -dir $SEGMENTSDIR -filter
4. run bin/nutch mergelinkdb $NEWLINKDBDIR $LINKDBDIR -filter
5. run standard set to rebuild index:
     bin/nutch index $NEWINDEXESDIR $CRAWLDBDIR $LINKDBDIR $NEWSEGLIST
     bin/nutch dedup $NEWINDEXESDIR
     bin/nutch merge -workingdir $NUTCHTMPDIR $NEWINDEXDIR $NEWINDEXESDIR

The variable names should be self-explanatory.  If not,
just let me know.

JohnM

Reply via email to