Re: [Nutch-general] Deleting documents

Honda-Search Administrator Fri, 23 Jun 2006 17:56:47 -0700

I appears that I also somehow picked up a bunch of extra documents in my 
original crawl or subsequent recrawl.

Can anyone give me an example of the prune command in two way:

1.  delete all entries that contain a certain term.
2.  delete all entries from a certain URL

Thanks for any help anyone can offer.

Matt

----- Original Message ----- 
From: "TDLN" <[EMAIL PROTECTED]>
To: <[email protected]>; "Dima Mazmanov" <[EMAIL PROTECTED]>
Sent: Friday, June 23, 2006 10:52 AM
Subject: Re: Deleting documents

Prune is ok to remove the docs from the index, but it will not prevent
the pages from being refetched, so you might also want to change the
regex-urlfilter (or crawl-ulrfilter if you are usign the crawltool)
for that purpose.

Rgrds,. Thomas

On 6/22/06, Dima Mazmanov <[EMAIL PROTECTED]> wrote:
> Hi,Rajesh.
>
> Use "prune" tool.
> ./nutch prune /path/to/segments/dir /path/to/file/with/rules
>
> You wrote 21 èþíÿ 2006 ã., 20:35:34:
>
> > I would like to delete certain documents from the crawled documents
> > depending on a certain criteria. Is there a way to achieve this? My 
> > guess
> > is, nutch downloads all the files before parsing it.
>
>
> > __________ NOD32 1.1611 (20060620) Information __________
>
> > This message was checked by NOD32 antivirus system.
> > http://www.eset.com
>
>
>
>
> --
> Regards,
>  Dima                          mailto:[EMAIL PROTECTED]
>
>

Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Deleting documents

Reply via email to