Hi, I have a good experience with regex-urlfilter and am doing the same thing you are trying to do and it works fine for me.
Could it be that your page is accepted by some other regex expression prior to the excerpt you showed us? Are you using crawl tool or step by step crawl cycle? On the other hand, it seems that nutch wizs prefers pre/post url filter over to regex (probably due to performance reasons). Regards, Lukas On 8/12/06, Dennis Kubes <[EMAIL PROTECTED]> wrote: > You can use a suffix filter if there are no query strings. > > Dennis > > Jens Martin Schubert wrote: > > Hello, > > > > is it possible to crawl e.g. http://www.domain.com, > > but to skip crawling all urls matching to > > (http://www.domain.com/subpage/) > > > > I tried to achieve this with crawl-urlfilter.txt/regex-urlfilter.txt. > > but it doesn't work: > > > > -ftp.tu-clausthal.de > > -^http://([a-z0-9]*\.)asta.tu-clausthal.de/de/mobil/ > > +^http://([a-z0-9]*\.)asta.tu-clausthal.de > > +^http://([a-z0-9]*\.)*tu-clausthal.de/ > > # skip everything else > > -. > > > > skipping ftp.tu-clausthal.de works perfect, > > but http://www.asta.tu-clausthal.de/de/mobil/ is still indexed, which > > takes a long time to crawl. > > > > regards, > > Jens Martin Schubert > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
