On Fri, Jul 17, 2009 at 15:23, Larsson85<[email protected]> wrote: > > Any workaround for this? Making nutch identify as something else or something > similar? >
Also note that nutch does not crawl anything with '?', or '&' in URL. Check out crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you use crawl command or inject/generate/fetch/parse etc. commands). > > reinhard schwab wrote: >> >> http://www.google.se/robots.txt >> >> google disallows it. >> >> User-agent: * >> Allow: /searchhistory/ >> Disallow: /search >> >> >> Larsson85 schrieb: >>> Why isnt nutch able to handle links from google? >>> >>> I tried to start a crawl from the following url >>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >>> >>> And all I get is "no more URLs to fetch" >>> >>> The reason for why I want to do this is because I had a tought on maby I >>> could use google to generate my start list of urls by injecting pages of >>> search result. >>> >>> Why wont this page be parsed and links extracted so the crawl can start? >>> >> >> >> > > -- > View this message in context: > http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
