you can check the response of google by dumping the segment bin/nutch readseg -dump crawl/segments/... somedirectory
reinhard schwab schrieb: > it seems that google is blocking the user agent > > i get this reply with lwp-request > > Your client does not have permission to get URL > <code>/search?q=site:se&hl=sv&start=100&sa=N</code> from > this server. (Client IP address: XX.XX.XX.XX)<br><br> > Please see Google's Terms of Service posted at > http://www.google.com/terms_of_service.html > > if you set the user agent properties to a client such as firefox, > google will serve your request. > > reinhard schwab schrieb: > >> http://www.google.se/robots.txt >> >> google disallows it. >> >> User-agent: * >> Allow: /searchhistory/ >> Disallow: /search >> >> >> Larsson85 schrieb: >> >> >>> Why isnt nutch able to handle links from google? >>> >>> I tried to start a crawl from the following url >>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >>> >>> And all I get is "no more URLs to fetch" >>> >>> The reason for why I want to do this is because I had a tought on maby I >>> could use google to generate my start list of urls by injecting pages of >>> search result. >>> >>> Why wont this page be parsed and links extracted so the crawl can start? >>> >>> >>> >> >> > > >
