I appears that I also somehow picked up a bunch of extra documents in my original crawl or subsequent recrawl.
Can anyone give me an example of the prune command in two way: 1. delete all entries that contain a certain term. 2. delete all entries from a certain URL Thanks for any help anyone can offer. Matt ----- Original Message ----- From: "TDLN" <[EMAIL PROTECTED]> To: <[email protected]>; "Dima Mazmanov" <[EMAIL PROTECTED]> Sent: Friday, June 23, 2006 10:52 AM Subject: Re: Deleting documents Prune is ok to remove the docs from the index, but it will not prevent the pages from being refetched, so you might also want to change the regex-urlfilter (or crawl-ulrfilter if you are usign the crawltool) for that purpose. Rgrds,. Thomas On 6/22/06, Dima Mazmanov <[EMAIL PROTECTED]> wrote: > Hi,Rajesh. > > Use "prune" tool. > ./nutch prune /path/to/segments/dir /path/to/file/with/rules > > You wrote 21 èþíÿ 2006 ã., 20:35:34: > > > I would like to delete certain documents from the crawled documents > > depending on a certain criteria. Is there a way to achieve this? My > > guess > > is, nutch downloads all the files before parsing it. > > > > __________ NOD32 1.1611 (20060620) Information __________ > > > This message was checked by NOD32 antivirus system. > > http://www.eset.com > > > > > -- > Regards, > Dima mailto:[EMAIL PROTECTED] > > Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
