Did you check robots.txt

On Wed, Apr 21, 2010 at 7:57 AM, joshua paul <jos...@neocodesoftware.com>wrote:

> after getting this email, I tried commenting out this line in
> regex-urlfilter.txt =
> #-[...@=]
> but it didn't help... i still get same message - no urls to feth
> regex-urlfilter.txt =
> # skip URLs containing certain characters as probable queries, etc.
> -[...@=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> # accept anything else
> +.
> crawl-urlfilter.txt =
> # skip URLs containing certain characters as probable queries, etc.
> # we don't want to skip
> #-[...@=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> +^http://([a-z0-9]*\.)*fmforums.com/
> # skip everything else
> -.
> arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM:
>  What is in your regex-urlfilter.txt?
>>> -----Original Message-----
>>> From: joshua paul [mailto:jos...@neocodesoftware.com]
>>> Sent: Wednesday, 21 April 2010 9:44 AM
>>> To: nutch-user@lucene.apache.org
>>> Subject: nutch says No URLs to fetch - check your seed list and URL
>>> filters when trying to index fmforums.com
>>> nutch says No URLs to fetch - check your seed list and URL filters when
>>> trying to index fmforums.com.
>>> I am using this command:
>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>>> - urls directory contains urls.txt which contains
>>> http://www.fmforums.com/
>>> - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/
>>> Note - my nutch setup indexes other sites fine.
>>> For example I am using this command:
>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>>> - urls directory contains urls.txt which contains
>>> http://dispatch.neocodesoftware.com
>>> - crawl-urlfilter.txt contains
>>> +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/
>>> And nutch generates a good crawl.
>>> How can I troubleshoot why nutch says "No URLs to fetch"?
> --
> catching falling stars...
> https://www.linkedin.com/in/joshuascottpaul
> MSN coga...@hotmail.com AOL neocodesoftware
> Yahoo joshuascottpaul Skype neocodesoftware
> Toll Free 1.888.748.0668 Fax 1-866-336-7246
> #238 - 425 Carrall St YVR BC V6B 6E3 CANADA
> www.neocodesoftware.com store.neocodesoftware.com
> www.monicapark.ca www.digitalpostercenter.com

Reply via email to