Re: Why cant I inject a google link to the database?

Dennis Kubes Fri, 17 Jul 2009 06:31:09 -0700

This isn't a user agent problem. No matter what user agent you use,Nutch is still not going to crawl this page because Nutch is correctlyfollowing robots.txt directives which block access. To change thiswould be to make the crawler impolite. A well behaved crawler shouldfollow the robots.txt directives.


Dennis


reinhard schwab wrote:

identify nutch as popular user agent such as firefox.

Larsson85 schrieb:

Any workaround for this? Making nutch identify as something else or something
similar?


reinhard schwab wrote:

http://www.google.se/robots.txt

google disallows it.

User-agent: *
Allow: /searchhistory/
Disallow: /search


Larsson85 schrieb:

Why isnt nutch able to handle links from google?

I tried to start a crawl from the following url
http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N

And all I get is "no more URLs to fetch"

The reason for why I want to do this is because I had a tought on maby I
could use google to generate my start list of urls by injecting pages of
search result.

Why wont this page be parsed and links extracted so the crawl can start?

Re: Why cant I inject a google link to the database?

Reply via email to