On Mon, Sep 29, 2008 at 9:17 PM, sangeet <[EMAIL PROTECTED]> wrote: > > I'm having a hard time trying to avoid crawling a particular url. > In regex-urlfilter.txt I added the following to ignore it. > -^http://([a-z0-9]*\.)*bhejacry.com/forums/ > > This url is not in the list in my urls directory. I also have > 'db.ignore.external.links' set to 'true'. > > However, I still see the following during the crawl > > fetching > http://www.bhejacry.com/forums/memberlist.php?mode=viewprofile&u=2774 > fetching http://www.bhejacry.com/forums/memberlist.php?mode=viewprofile&u=96 > > How do I ignore these urls?
Try running bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter Then simply type your url. If a url is filtered, it will be output back with a "-" at the beginning. (You will need the patch from NUTCH-654 . Or wait a couple of hours and I will commit it) > -- > View this message in context: > http://www.nabble.com/Ignoring-a-url-in-the-crawl-tp19729031p19729031.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
