Re: [Nutch-general] crawl-urlfilter subpages of domains

Dennis Kubes Sat, 12 Aug 2006 08:10:39 -0700

You can use a suffix filter if there are no query strings.

Dennis


Jens Martin Schubert wrote:
> Hello,
>
> is it possible to crawl e.g. http://www.domain.com,
> but to skip crawling all urls matching to 
> (http://www.domain.com/subpage/)
>
> I tried to achieve this with crawl-urlfilter.txt/regex-urlfilter.txt. 
> but it doesn't work:
>
> -ftp.tu-clausthal.de
> -^http://([a-z0-9]*\.)asta.tu-clausthal.de/de/mobil/
> +^http://([a-z0-9]*\.)asta.tu-clausthal.de
> +^http://([a-z0-9]*\.)*tu-clausthal.de/
> # skip everything else
> -.
>
> skipping ftp.tu-clausthal.de works perfect,
> but http://www.asta.tu-clausthal.de/de/mobil/ is still indexed, which 
> takes a long time to crawl.
>
> regards,
> Jens Martin Schubert

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] crawl-urlfilter subpages of domains

Reply via email to