Re: Deleteing an index document in nutch

Dennis Kubes Thu, 07 Feb 2008 19:50:50 -0800

An easier way to do this (after some digging) is to use:


bin/nutch org.apache.nutch.tools.PruneIndexTool

You would first need to stop the DistributedSearch$Server, run the tool,which has a dryrun mode as well, then restart the server. Another morebrute force way to do this if your indexes are in the form part-00000 isto delete an entire part-xxxxx. The prune tool will need to be run oneach part-xxxxx within a single shard.

Be aware that this will not stop urls from coming back when content isreindexed, it will only remove them from the current index.


Dennis

John Mendenhall wrote:

Anybody know how to delete an index document in a distributed searchserver? Is that even possible?


I will assume by index document, you are
referring to a document that has been indexed.
If not, delete and forget.

When we need to remove a document, we go through
the process of filtering out the document by
using the following procedure:

1. build temporary nutch configuration directory
     build special filter files based on document(s) to be filtered out
     point NUTCH_CONF_DIR env var to temporary nutch configuration directory
2. run bin/nutch mergedb $NEWCRAWLDBDIR $CRAWLDBDIR -filter
3. run bin/nutch mergesegs $NEWSEGMENTSDIR -dir $SEGMENTSDIR -filter
4. run bin/nutch mergelinkdb $NEWLINKDBDIR $LINKDBDIR -filter
5. run standard set to rebuild index:
     bin/nutch index $NEWINDEXESDIR $CRAWLDBDIR $LINKDBDIR $NEWSEGLIST
     bin/nutch dedup $NEWINDEXESDIR
     bin/nutch merge -workingdir $NUTCHTMPDIR $NEWINDEXDIR $NEWINDEXESDIR

The variable names should be self-explanatory.  If not,
just let me know.

JohnM

Re: Deleteing an index document in nutch

Reply via email to