identify nutch as popular user agent such as firefox.

Larsson85 schrieb:
> Any workaround for this? Making nutch identify as something else or something
> similar?
>
>
> reinhard schwab wrote:
>   
>> http://www.google.se/robots.txt
>>
>> google disallows it.
>>
>> User-agent: *
>> Allow: /searchhistory/
>> Disallow: /search
>>
>>
>> Larsson85 schrieb:
>>     
>>> Why isnt nutch able to handle links from google?
>>>
>>> I tried to start a crawl from the following url
>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N
>>>
>>> And all I get is "no more URLs to fetch"
>>>
>>> The reason for why I want to do this is because I had a tought on maby I
>>> could use google to generate my start list of urls by injecting pages of
>>> search result.
>>>
>>> Why wont this page be parsed and links extracted so the crawl can start?
>>>   
>>>       
>>
>>     
>
>   

Reply via email to