2009/7/17 Doğacan Güney <[email protected]>: > On Fri, Jul 17, 2009 at 15:23, Larsson85<[email protected]> wrote: >> >> Any workaround for this? Making nutch identify as something else or something >> similar? >> > > Also note that nutch does not crawl anything with '?', or '&' in URL. Check > out
Oops. I mean nutch does not crawl any such URL *by default*. > crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you > use crawl command > or inject/generate/fetch/parse etc. commands). > >> >> reinhard schwab wrote: >>> >>> http://www.google.se/robots.txt >>> >>> google disallows it. >>> >>> User-agent: * >>> Allow: /searchhistory/ >>> Disallow: /search >>> >>> >>> Larsson85 schrieb: >>>> Why isnt nutch able to handle links from google? >>>> >>>> I tried to start a crawl from the following url >>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >>>> >>>> And all I get is "no more URLs to fetch" >>>> >>>> The reason for why I want to do this is because I had a tought on maby I >>>> could use google to generate my start list of urls by injecting pages of >>>> search result. >>>> >>>> Why wont this page be parsed and links extracted so the crawl can start? >>>> >>> >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > > -- > Doğacan Güney > -- Doğacan Güney
