Hi Sriram, In regex, . matches to any single character, and following . with a * matches that single character zero or more times. That is, .* in combination is a wildcard match.
So modifying your regex to: -^http://wiki.mydomain.com/index.php/Special:.* should fix the problem. - Ravi Chintakunta On 3/22/07, SriramG <[EMAIL PROTECTED]> wrote: > > I trying to crawl a wikipedia site. > > I want to skip any url which has the term Special: > > Eg: > https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page > https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page > https://wiki.mydomain.com/index.php/Special:Watchlist > https://wiki.mydomain.com/index.php/Special:Contributions/SName > https://wiki.mydomain.com/index.php/Special:Recentchanges > > This is my crawl-urlfilter.txt > -^http://wiki.mydomain.com/index.php/Special: > -^http://wiki.mydomain.com/index.php/Special:* > -^http://wiki.mydomain.com/index.php/Special:*/ > -^http://wiki.mydomain.com/index.php/Special:*/* > -^https://wiki.mydomain.com/index.php/Special:Upload > +^https://wiki.mydomain.com/index.php > -. > > But I still see the fetcher logs. > > 2007-03-22 12:52:15,387 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php > 2007-03-22 12:52:32,128 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/Telecom > 2007-03-22 12:52:32,159 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/Special:Contributions/SName > 2007-03-22 12:52:32,159 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/Special:Watchlist > 2007-03-22 12:52:32,179 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/Special:Preferences > 2007-03-22 12:52:32,198 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/Special:Recentchanges > 2007-03-22 12:52:32,322 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/Talk:Main_Page > 2007-03-22 12:52:32,323 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page > 2007-03-22 12:52:32,326 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/BCP > 2007-03-22 12:52:32,339 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page > 2007-03-22 12:52:32,343 INFO fetcher.Fetcher - fetching > https://wiki.mydomain.com/index.php/Network_Engineering > > > Not sure whats wrong in my regular expression. > > Any help please. > > > -- > View this message in context: > http://www.nabble.com/Need-Help-with-crawl-urlfilter.txt-tf3450339.html#a9623983 > Sent from the Nutch - User mailing list archive at Nabble.com. > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
