your are right.
robots.txt clearly disallows this page.
this page will not be fetched.

i remember google has some APIs to access the search.
http://code.google.com/intl/de-DE/apis/soapsearch/index.html
http://code.google.com/intl/de-DE/apis/ajaxsearch/

reinhard

Dennis Kubes schrieb:
> This isn't a user agent problem.  No matter what user agent you use,
> Nutch is still not going to crawl this page because Nutch is correctly
> following robots.txt directives which block access.  To change this
> would be to make the crawler impolite.  A well behaved crawler should
> follow the robots.txt directives.
>
> Dennis
>
> reinhard schwab wrote:
>> identify nutch as popular user agent such as firefox.
>>
>> Larsson85 schrieb:
>>> Any workaround for this? Making nutch identify as something else or
>>> something
>>> similar?
>>>
>>>
>>> reinhard schwab wrote:
>>>  
>>>> http://www.google.se/robots.txt
>>>>
>>>> google disallows it.
>>>>
>>>> User-agent: *
>>>> Allow: /searchhistory/
>>>> Disallow: /search
>>>>
>>>>
>>>> Larsson85 schrieb:
>>>>    
>>>>> Why isnt nutch able to handle links from google?
>>>>>
>>>>> I tried to start a crawl from the following url
>>>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N
>>>>>
>>>>> And all I get is "no more URLs to fetch"
>>>>>
>>>>> The reason for why I want to do this is because I had a tought on
>>>>> maby I
>>>>> could use google to generate my start list of urls by injecting
>>>>> pages of
>>>>> search result.
>>>>>
>>>>> Why wont this page be parsed and links extracted so the crawl can
>>>>> start?
>>>>>         
>>>>     
>>>   
>>
>

Reply via email to