Re: crawl-urlfilter subpages of domains

Lukas Vlcek Mon, 14 Aug 2006 09:30:54 -0700

Hi,

I have a good experience with regex-urlfilter and am doing the same
thing you are trying to do and it works fine for me.


Could it be that your page is accepted by some other regex expression
prior to the excerpt you showed us? Are you using crawl tool or step
by step crawl cycle?

On the other hand, it seems that nutch wizs prefers pre/post url
filter over to regex (probably due to performance reasons).

Regards,
Lukas

On 8/12/06, Dennis Kubes <[EMAIL PROTECTED]> wrote:

You can use a suffix filter if there are no query strings.

Dennis

Jens Martin Schubert wrote:
> Hello,
>
> is it possible to crawl e.g. http://www.domain.com,
> but to skip crawling all urls matching to
> (http://www.domain.com/subpage/)
>
> I tried to achieve this with crawl-urlfilter.txt/regex-urlfilter.txt.
> but it doesn't work:
>
> -ftp.tu-clausthal.de
> -^http://([a-z0-9]*\.)asta.tu-clausthal.de/de/mobil/
> +^http://([a-z0-9]*\.)asta.tu-clausthal.de
> +^http://([a-z0-9]*\.)*tu-clausthal.de/
> # skip everything else
> -.
>
> skipping ftp.tu-clausthal.de works perfect,
> but http://www.asta.tu-clausthal.de/de/mobil/ is still indexed, which
> takes a long time to crawl.
>
> regards,
> Jens Martin Schubert

Re: crawl-urlfilter subpages of domains

Reply via email to