Check their TOS. If you are trying to make a business out of specifically their data, then they probably will be hostile out of it. They probably allow big search engines to do it because it gives them a quid pro quo in terms of referrals. If you are just trying to datamine their site, they probably dont want it to happen.

Having said that, you might want to check their Robots.txt and see if for soem reason Nutch is hitting them too much (ie isn't honoring their robots.txt). The other approach is to distribute the crawl among multiple machines and IP address ranges....

Winton


I am new to Nutch.

My goal is to extract content (local listings) of a certain website. I have
obtained the urls of all the listings (only ~20K). And I also wrote a parser
to pull the contents (like address and phone). All I need is to download the
urls.

But as I used download tool to batch download the urls, very quickly I
started to get 404 responses in downloaded pages.

Is there a way I can do this in nutch? What's the risk of being blocked
again? I just want the urls, no crawl, no indexing, just plain fetch and
leaving them intact.

Reply via email to