after getting this email, I tried commenting out this line in
regex-urlfilter.txt =
#-[...@=]
but it didn't help... i still get same message - no urls to feth
regex-urlfilter.txt =
# skip URLs containing certain characters as probable queries, etc.
-[...@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.
crawl-urlfilter.txt =
# skip URLs containing certain characters as probable queries, etc.
# we don't want to skip
#-[...@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+^http://([a-z0-9]*\.)*fmforums.com/
# skip everything else
-.
arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM:
What is in your regex-urlfilter.txt?
-----Original Message-----
From: joshua paul [mailto:jos...@neocodesoftware.com]
Sent: Wednesday, 21 April 2010 9:44 AM
To: nutch-user@lucene.apache.org
Subject: nutch says No URLs to fetch - check your seed list and URL
filters when trying to index fmforums.com
nutch says No URLs to fetch - check your seed list and URL filters when
trying to index fmforums.com.
I am using this command:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
- urls directory contains urls.txt which contains
http://www.fmforums.com/
- crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/
Note - my nutch setup indexes other sites fine.
For example I am using this command:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
- urls directory contains urls.txt which contains
http://dispatch.neocodesoftware.com
- crawl-urlfilter.txt contains
+^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/
And nutch generates a good crawl.
How can I troubleshoot why nutch says "No URLs to fetch"?
--
catching falling stars...
https://www.linkedin.com/in/joshuascottpaul
MSN coga...@hotmail.com AOL neocodesoftware
Yahoo joshuascottpaul Skype neocodesoftware
Toll Free 1.888.748.0668 Fax 1-866-336-7246
#238 - 425 Carrall St YVR BC V6B 6E3 CANADA
www.neocodesoftware.com store.neocodesoftware.com
www.monicapark.ca www.digitalpostercenter.com