I appears that I also somehow picked up a bunch of extra documents in my
original crawl or subsequent recrawl.
Can anyone give me an example of the prune command in two way:
1. delete all entries that contain a certain term.
2. delete all entries from a certain URL
Thanks for any help anyone can offer.
Matt
----- Original Message -----
From: "TDLN" <[EMAIL PROTECTED]>
To: <[email protected]>; "Dima Mazmanov" <[EMAIL PROTECTED]>
Sent: Friday, June 23, 2006 10:52 AM
Subject: Re: Deleting documents
Prune is ok to remove the docs from the index, but it will not prevent
the pages from being refetched, so you might also want to change the
regex-urlfilter (or crawl-ulrfilter if you are usign the crawltool)
for that purpose.
Rgrds,. Thomas
On 6/22/06, Dima Mazmanov <[EMAIL PROTECTED]> wrote:
Hi,Rajesh.
Use "prune" tool.
./nutch prune /path/to/segments/dir /path/to/file/with/rules
You wrote 21 èþíÿ 2006 ã., 20:35:34:
> I would like to delete certain documents from the crawled documents
> depending on a certain criteria. Is there a way to achieve this? My
> guess
> is, nutch downloads all the files before parsing it.
> __________ NOD32 1.1611 (20060620) Information __________
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
--
Regards,
Dima mailto:[EMAIL PROTECTED]