your are right. robots.txt clearly disallows this page. this page will not be fetched.
i remember google has some APIs to access the search. http://code.google.com/intl/de-DE/apis/soapsearch/index.html http://code.google.com/intl/de-DE/apis/ajaxsearch/ reinhard Dennis Kubes schrieb: > This isn't a user agent problem. No matter what user agent you use, > Nutch is still not going to crawl this page because Nutch is correctly > following robots.txt directives which block access. To change this > would be to make the crawler impolite. A well behaved crawler should > follow the robots.txt directives. > > Dennis > > reinhard schwab wrote: >> identify nutch as popular user agent such as firefox. >> >> Larsson85 schrieb: >>> Any workaround for this? Making nutch identify as something else or >>> something >>> similar? >>> >>> >>> reinhard schwab wrote: >>> >>>> http://www.google.se/robots.txt >>>> >>>> google disallows it. >>>> >>>> User-agent: * >>>> Allow: /searchhistory/ >>>> Disallow: /search >>>> >>>> >>>> Larsson85 schrieb: >>>> >>>>> Why isnt nutch able to handle links from google? >>>>> >>>>> I tried to start a crawl from the following url >>>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >>>>> >>>>> And all I get is "no more URLs to fetch" >>>>> >>>>> The reason for why I want to do this is because I had a tought on >>>>> maby I >>>>> could use google to generate my start list of urls by injecting >>>>> pages of >>>>> search result. >>>>> >>>>> Why wont this page be parsed and links extracted so the crawl can >>>>> start? >>>>> >>>> >>> >> >
