Re: [Nutch-general] crawl-urlfilter subpages of domains

Lukas Vlcek Mon, 14 Aug 2006 09:31:12 -0700

Hi,

I have a good experience with regex-urlfilter and am doing the same
thing you are trying to do and it works fine for me.


Could it be that your page is accepted by some other regex expression
prior to the excerpt you showed us? Are you using crawl tool or step
by step crawl cycle?

On the other hand, it seems that nutch wizs prefers pre/post url
filter over to regex (probably due to performance reasons).

Regards,
Lukas

On 8/12/06, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> You can use a suffix filter if there are no query strings.
>
> Dennis
>
> Jens Martin Schubert wrote:
> > Hello,
> >
> > is it possible to crawl e.g. http://www.domain.com,
> > but to skip crawling all urls matching to
> > (http://www.domain.com/subpage/)
> >
> > I tried to achieve this with crawl-urlfilter.txt/regex-urlfilter.txt.
> > but it doesn't work:
> >
> > -ftp.tu-clausthal.de
> > -^http://([a-z0-9]*\.)asta.tu-clausthal.de/de/mobil/
> > +^http://([a-z0-9]*\.)asta.tu-clausthal.de
> > +^http://([a-z0-9]*\.)*tu-clausthal.de/
> > # skip everything else
> > -.
> >
> > skipping ftp.tu-clausthal.de works perfect,
> > but http://www.asta.tu-clausthal.de/de/mobil/ is still indexed, which
> > takes a long time to crawl.
> >
> > regards,
> > Jens Martin Schubert
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] crawl-urlfilter subpages of domains

Reply via email to