you can also use commons-httpclient or htmlunit to access the search of
google.
these tools are not crawlers. with htmlunit it would be easy to get the
outlinks.
i strongly advice you not to misuse google search by too many requests.
google will block you i assume.

by using a search api, you are allowed to request it 1000 times per day
if i remember correct,
it is mentioned there in the terms of use or elsewhere in the documentation.

google returns a maximum of 1000 links in a search result and
a maximum of 100 links in one page.
if you set this search parameter,
&num=100
you will get 100 links per result page.


Brian Ulicny schrieb:
> 1. Save the results page.
> 2. Grep the links out of it.
> 3. Put the results in a doc in your urls directory
> 4. Do: bin/nutch crawl urls ....
>
>
> On Fri, 17 Jul 2009 02:32 -0700, "Larsson85" <[email protected]>
> wrote:
>   
>> I think I need more help on how to do this.
>>
>> I tried using
>> <property>
>>   <name>http.robots.agents</name>
>>   <value>Mozilla/5.0*</value>
>>   <description>The agent strings we'll look for in robots.txt files,
>>   comma-separated, in decreasing order of precedence. You should
>>   put the value of http.agent.name as the first agent name, and keep the
>>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>>   </description>
>> </property>
>>
>> If I dont have the star in the end I get the same as earlier, "No URLs to
>> fetch". And if I do I get 0 records selected for fetching, exiting
>>
>>
>>
>> reinhard schwab wrote:
>>     
>>> identify nutch as popular user agent such as firefox.
>>>
>>> Larsson85 schrieb:
>>>       
>>>> Any workaround for this? Making nutch identify as something else or
>>>> something
>>>> similar?
>>>>
>>>>
>>>> reinhard schwab wrote:
>>>>   
>>>>         
>>>>> http://www.google.se/robots.txt
>>>>>
>>>>> google disallows it.
>>>>>
>>>>> User-agent: *
>>>>> Allow: /searchhistory/
>>>>> Disallow: /search
>>>>>
>>>>>
>>>>> Larsson85 schrieb:
>>>>>     
>>>>>           
>>>>>> Why isnt nutch able to handle links from google?
>>>>>>
>>>>>> I tried to start a crawl from the following url
>>>>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N
>>>>>>
>>>>>> And all I get is "no more URLs to fetch"
>>>>>>
>>>>>> The reason for why I want to do this is because I had a tought on maby
>>>>>> I
>>>>>> could use google to generate my start list of urls by injecting pages
>>>>>> of
>>>>>> search result.
>>>>>>
>>>>>> Why wont this page be parsed and links extracted so the crawl can
>>>>>> start?
>>>>>>   
>>>>>>       
>>>>>>             
>>>>>     
>>>>>           
>>>>   
>>>>         
>>>
>>>       
>> -- 
>> View this message in context:
>> http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>     

Reply via email to