it seems that google is blocking the user agent i get this reply with lwp-request
Your client does not have permission to get URL <code>/search?q=site:se&hl=sv&start=100&sa=N</code> from this server. (Client IP address: XX.XX.XX.XX)<br><br> Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html if you set the user agent properties to a client such as firefox, google will serve your request. reinhard schwab schrieb: > http://www.google.se/robots.txt > > google disallows it. > > User-agent: * > Allow: /searchhistory/ > Disallow: /search > > > Larsson85 schrieb: > >> Why isnt nutch able to handle links from google? >> >> I tried to start a crawl from the following url >> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >> >> And all I get is "no more URLs to fetch" >> >> The reason for why I want to do this is because I had a tought on maby I >> could use google to generate my start list of urls by injecting pages of >> search result. >> >> Why wont this page be parsed and links extracted so the crawl can start? >> >> > > >
