An easier way to do this (after some digging) is to use:
bin/nutch org.apache.nutch.tools.PruneIndexTool
You would first need to stop the DistributedSearch$Server, run the tool,
which has a dryrun mode as well, then restart the server. Another more
brute force way to do this if your indexes are in the form part-00000 is
to delete an entire part-xxxxx. The prune tool will need to be run on
each part-xxxxx within a single shard.
Be aware that this will not stop urls from coming back when content is
reindexed, it will only remove them from the current index.
Dennis
John Mendenhall wrote:
Anybody know how to delete an index document in a distributed search
server? Is that even possible?
I will assume by index document, you are
referring to a document that has been indexed.
If not, delete and forget.
When we need to remove a document, we go through
the process of filtering out the document by
using the following procedure:
1. build temporary nutch configuration directory
build special filter files based on document(s) to be filtered out
point NUTCH_CONF_DIR env var to temporary nutch configuration directory
2. run bin/nutch mergedb $NEWCRAWLDBDIR $CRAWLDBDIR -filter
3. run bin/nutch mergesegs $NEWSEGMENTSDIR -dir $SEGMENTSDIR -filter
4. run bin/nutch mergelinkdb $NEWLINKDBDIR $LINKDBDIR -filter
5. run standard set to rebuild index:
bin/nutch index $NEWINDEXESDIR $CRAWLDBDIR $LINKDBDIR $NEWSEGLIST
bin/nutch dedup $NEWINDEXESDIR
bin/nutch merge -workingdir $NUTCHTMPDIR $NEWINDEXDIR $NEWINDEXESDIR
The variable names should be self-explanatory. If not,
just let me know.
JohnM