Re: Just to save webpages (Newbie question)

Winton Davies Wed, 08 Oct 2008 11:09:48 -0700

Check their TOS. If you are trying to make a business out ofspecifically their data, then they probably will be hostile out ofit. They probably allow big search engines to do it because it givesthem a quid pro quo in terms of referrals. If you are just trying todatamine their site, they probably dont want it to happen.

Having said that, you might want to check their Robots.txt and see iffor soem reason Nutch is hitting them too much (ie isn't honoringtheir robots.txt). The other approach is to distribute the crawlamong multiple machines and IP address ranges....


Winton

I am new to Nutch.

My goal is to extract content (local listings) of a certain website. I have
obtained the urls of all the listings (only ~20K). And I also wrote a parser
to pull the contents (like address and phone). All I need is to download the
urls.

But as I used download tool to batch download the urls, very quickly I
started to get 404 responses in downloaded pages.

Is there a way I can do this in nutch? What's the risk of being blocked
again? I just want the urls, no crawl, no indexing, just plain fetch and
leaving them intact.

Re: Just to save webpages (Newbie question)

Reply via email to