Hi, I have a good experience with regex-urlfilter and am doing the same thing you are trying to do and it works fine for me.
Could it be that your page is accepted by some other regex expression prior to the excerpt you showed us? Are you using crawl tool or step by step crawl cycle? On the other hand, it seems that nutch wizs prefers pre/post url filter over to regex (probably due to performance reasons). Regards, Lukas On 8/12/06, Dennis Kubes <[EMAIL PROTECTED]> wrote:
You can use a suffix filter if there are no query strings. Dennis Jens Martin Schubert wrote: > Hello, > > is it possible to crawl e.g. http://www.domain.com, > but to skip crawling all urls matching to > (http://www.domain.com/subpage/) > > I tried to achieve this with crawl-urlfilter.txt/regex-urlfilter.txt. > but it doesn't work: > > -ftp.tu-clausthal.de > -^http://([a-z0-9]*\.)asta.tu-clausthal.de/de/mobil/ > +^http://([a-z0-9]*\.)asta.tu-clausthal.de > +^http://([a-z0-9]*\.)*tu-clausthal.de/ > # skip everything else > -. > > skipping ftp.tu-clausthal.de works perfect, > but http://www.asta.tu-clausthal.de/de/mobil/ is still indexed, which > takes a long time to crawl. > > regards, > Jens Martin Schubert
