I trying to crawl a wikipedia site.

I want to skip any url which has the term Special:

Eg:
https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
https://wiki.mydomain.com/index.php/Special:Watchlist
https://wiki.mydomain.com/index.php/Special:Contributions/SName
https://wiki.mydomain.com/index.php/Special:Recentchanges

This is my crawl-urlfilter.txt
-^http://wiki.mydomain.com/index.php/Special:
-^http://wiki.mydomain.com/index.php/Special:*
-^http://wiki.mydomain.com/index.php/Special:*/
-^http://wiki.mydomain.com/index.php/Special:*/*
-^https://wiki.mydomain.com/index.php/Special:Upload
+^https://wiki.mydomain.com/index.php
-.

But I still see the fetcher logs.

2007-03-22 12:52:15,387 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php
2007-03-22 12:52:32,128 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Telecom
2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Contributions/SName
2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Watchlist
2007-03-22 12:52:32,179 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Preferences
2007-03-22 12:52:32,198 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Recentchanges
2007-03-22 12:52:32,322 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Talk:Main_Page
2007-03-22 12:52:32,323 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
2007-03-22 12:52:32,326 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/BCP
2007-03-22 12:52:32,339 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
2007-03-22 12:52:32,343 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Network_Engineering


Not sure whats wrong in my regular expression.

Any help please.


-- 
View this message in context: 
http://www.nabble.com/Need-Help-with-crawl-urlfilter.txt-tf3450339.html#a9623983
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to