Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paul <jos...@neocodesoftware.com>wrote:
> after getting this email, I tried commenting out this line in > regex-urlfilter.txt = > > #-[...@=] > > but it didn't help... i still get same message - no urls to feth > > > regex-urlfilter.txt = > > # skip URLs containing certain characters as probable queries, etc. > -[...@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > # accept anything else > +. > > crawl-urlfilter.txt = > > # skip URLs containing certain characters as probable queries, etc. > # we don't want to skip > #-[...@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > +^http://([a-z0-9]*\.)*fmforums.com/ > > # skip everything else > -. > > > arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM: > > What is in your regex-urlfilter.txt? >> >> >> >>> -----Original Message----- >>> From: joshua paul [mailto:jos...@neocodesoftware.com] >>> Sent: Wednesday, 21 April 2010 9:44 AM >>> To: nutch-user@lucene.apache.org >>> Subject: nutch says No URLs to fetch - check your seed list and URL >>> filters when trying to index fmforums.com >>> >>> nutch says No URLs to fetch - check your seed list and URL filters when >>> trying to index fmforums.com. >>> >>> I am using this command: >>> >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50 >>> >>> - urls directory contains urls.txt which contains >>> http://www.fmforums.com/ >>> - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ >>> >>> Note - my nutch setup indexes other sites fine. >>> >>> For example I am using this command: >>> >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50 >>> >>> - urls directory contains urls.txt which contains >>> http://dispatch.neocodesoftware.com >>> - crawl-urlfilter.txt contains >>> +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ >>> >>> And nutch generates a good crawl. >>> >>> How can I troubleshoot why nutch says "No URLs to fetch"? >>> >>> >> > -- > catching falling stars... > > https://www.linkedin.com/in/joshuascottpaul > MSN coga...@hotmail.com AOL neocodesoftware > Yahoo joshuascottpaul Skype neocodesoftware > Toll Free 1.888.748.0668 Fax 1-866-336-7246 > #238 - 425 Carrall St YVR BC V6B 6E3 CANADA > > www.neocodesoftware.com store.neocodesoftware.com > www.monicapark.ca www.digitalpostercenter.com > >