This isn't a user agent problem. No matter what user agent you use, Nutch is still not going to crawl this page because Nutch is correctly following robots.txt directives which block access. To change this would be to make the crawler impolite. A well behaved crawler should follow the robots.txt directives.

Dennis

reinhard schwab wrote:
identify nutch as popular user agent such as firefox.

Larsson85 schrieb:
Any workaround for this? Making nutch identify as something else or something
similar?


reinhard schwab wrote:
http://www.google.se/robots.txt

google disallows it.

User-agent: *
Allow: /searchhistory/
Disallow: /search


Larsson85 schrieb:
Why isnt nutch able to handle links from google?

I tried to start a crawl from the following url
http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N

And all I get is "no more URLs to fetch"

The reason for why I want to do this is because I had a tought on maby I
could use google to generate my start list of urls by injecting pages of
search result.

Why wont this page be parsed and links extracted so the crawl can start?

Reply via email to