you can also use commons-httpclient or htmlunit to access the search of google. these tools are not crawlers. with htmlunit it would be easy to get the outlinks. i strongly advice you not to misuse google search by too many requests. google will block you i assume.
by using a search api, you are allowed to request it 1000 times per day if i remember correct, it is mentioned there in the terms of use or elsewhere in the documentation. google returns a maximum of 1000 links in a search result and a maximum of 100 links in one page. if you set this search parameter, &num=100 you will get 100 links per result page. Brian Ulicny schrieb: > 1. Save the results page. > 2. Grep the links out of it. > 3. Put the results in a doc in your urls directory > 4. Do: bin/nutch crawl urls .... > > > On Fri, 17 Jul 2009 02:32 -0700, "Larsson85" <[email protected]> > wrote: > >> I think I need more help on how to do this. >> >> I tried using >> <property> >> <name>http.robots.agents</name> >> <value>Mozilla/5.0*</value> >> <description>The agent strings we'll look for in robots.txt files, >> comma-separated, in decreasing order of precedence. You should >> put the value of http.agent.name as the first agent name, and keep the >> default * at the end of the list. E.g.: BlurflDev,Blurfl,* >> </description> >> </property> >> >> If I dont have the star in the end I get the same as earlier, "No URLs to >> fetch". And if I do I get 0 records selected for fetching, exiting >> >> >> >> reinhard schwab wrote: >> >>> identify nutch as popular user agent such as firefox. >>> >>> Larsson85 schrieb: >>> >>>> Any workaround for this? Making nutch identify as something else or >>>> something >>>> similar? >>>> >>>> >>>> reinhard schwab wrote: >>>> >>>> >>>>> http://www.google.se/robots.txt >>>>> >>>>> google disallows it. >>>>> >>>>> User-agent: * >>>>> Allow: /searchhistory/ >>>>> Disallow: /search >>>>> >>>>> >>>>> Larsson85 schrieb: >>>>> >>>>> >>>>>> Why isnt nutch able to handle links from google? >>>>>> >>>>>> I tried to start a crawl from the following url >>>>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >>>>>> >>>>>> And all I get is "no more URLs to fetch" >>>>>> >>>>>> The reason for why I want to do this is because I had a tought on maby >>>>>> I >>>>>> could use google to generate my start list of urls by injecting pages >>>>>> of >>>>>> search result. >>>>>> >>>>>> Why wont this page be parsed and links extracted so the crawl can >>>>>> start? >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> -- >> View this message in context: >> http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >>
