I think I need more help on how to do this.

I tried using
<property>
  <name>http.robots.agents</name>
  <value>Mozilla/5.0*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

If I dont have the star in the end I get the same as earlier, "No URLs to
fetch". And if I do I get 0 records selected for fetching, exiting



reinhard schwab wrote:
> 
> identify nutch as popular user agent such as firefox.
> 
> Larsson85 schrieb:
>> Any workaround for this? Making nutch identify as something else or
>> something
>> similar?
>>
>>
>> reinhard schwab wrote:
>>   
>>> http://www.google.se/robots.txt
>>>
>>> google disallows it.
>>>
>>> User-agent: *
>>> Allow: /searchhistory/
>>> Disallow: /search
>>>
>>>
>>> Larsson85 schrieb:
>>>     
>>>> Why isnt nutch able to handle links from google?
>>>>
>>>> I tried to start a crawl from the following url
>>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N
>>>>
>>>> And all I get is "no more URLs to fetch"
>>>>
>>>> The reason for why I want to do this is because I had a tought on maby
>>>> I
>>>> could use google to generate my start list of urls by injecting pages
>>>> of
>>>> search result.
>>>>
>>>> Why wont this page be parsed and links extracted so the crawl can
>>>> start?
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to