I think I need more help on how to do this. I tried using <property> <name>http.robots.agents</name> <value>Mozilla/5.0*</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property>
If I dont have the star in the end I get the same as earlier, "No URLs to fetch". And if I do I get 0 records selected for fetching, exiting reinhard schwab wrote: > > identify nutch as popular user agent such as firefox. > > Larsson85 schrieb: >> Any workaround for this? Making nutch identify as something else or >> something >> similar? >> >> >> reinhard schwab wrote: >> >>> http://www.google.se/robots.txt >>> >>> google disallows it. >>> >>> User-agent: * >>> Allow: /searchhistory/ >>> Disallow: /search >>> >>> >>> Larsson85 schrieb: >>> >>>> Why isnt nutch able to handle links from google? >>>> >>>> I tried to start a crawl from the following url >>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >>>> >>>> And all I get is "no more URLs to fetch" >>>> >>>> The reason for why I want to do this is because I had a tought on maby >>>> I >>>> could use google to generate my start list of urls by injecting pages >>>> of >>>> search result. >>>> >>>> Why wont this page be parsed and links extracted so the crawl can >>>> start? >>>> >>>> >>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html Sent from the Nutch - User mailing list archive at Nabble.com.
