you can check the response of google by dumping the segment

bin/nutch readseg -dump crawl/segments/...   somedirectory


reinhard schwab schrieb:
> it seems that google is blocking the user agent
>
> i get this reply with lwp-request
>
> Your client does not have permission to get URL
> <code>/search?q=site:se&amp;hl=sv&amp;start=100&amp;sa=N</code> from
> this server.  (Client IP address: XX.XX.XX.XX)<br><br>
> Please see Google's Terms of Service posted at
> http://www.google.com/terms_of_service.html
>
> if you set the user agent properties to a client such as firefox,
> google will serve your request.
>
> reinhard schwab schrieb:
>   
>> http://www.google.se/robots.txt
>>
>> google disallows it.
>>
>> User-agent: *
>> Allow: /searchhistory/
>> Disallow: /search
>>
>>
>> Larsson85 schrieb:
>>   
>>     
>>> Why isnt nutch able to handle links from google?
>>>
>>> I tried to start a crawl from the following url
>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N
>>>
>>> And all I get is "no more URLs to fetch"
>>>
>>> The reason for why I want to do this is because I had a tought on maby I
>>> could use google to generate my start list of urls by injecting pages of
>>> search result.
>>>
>>> Why wont this page be parsed and links extracted so the crawl can start?
>>>   
>>>     
>>>       
>>   
>>     
>
>
>   

Reply via email to