nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

joshuasottpaul Fri, 16 Apr 2010 13:01:56 -0700

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

where urls directory contains urls.txt which contains
http://www.fmforums.com/


where crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ 

note - my nutch setup indexes other sites fine. for example

where urls directory contains urls.txt which contains
http://dispatch.neocodesoftware.com

where crawl-urlfilter.txt contains
+^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/

generates a good crawl... 

i know i have a known good install

so why does nutch say No URLs to fetch - check your seed list and URL
filters when trying to index fmforums.com???

also fmforums.com/robots.txt looks ok:

###############################
#
# sample robots.txt file for this website 
#
# addresses all robots by using wild card *
User-agent: *
#
# list folders robots are not allowed to index
#Disallow: /tutorials/404redirect/
Disallow:
#
# list specific files robots are not allowed to index
#Disallow: /tutorials/custom_error_page.html
Disallow: 
#
# list the location of any sitemaps
Sitemap: http://www.yourdomain.com/site_index.xml
#
# End of robots.txt file
#
###############################
-- 
View this message in context: 
http://n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying-to-index-fmforums-com-tp724973p724973.html
Sent from the Nutch - User mailing list archive at Nabble.com.

nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

Reply via email to