In vertical crawling, there are always some large sites that have tens of thousands of pages. Fetching a page from these large sites very often returns "retry later" because http.max.delays is exceeded. Setting appropriate values for http.max.delays and fetcher.server.delay can minimize this kind of url dropping. However, with my application , I still see 20-50% urls got dropped from a few large sites even with pretty long delay setting, http.max.delays=20, fetcher.server.delay=5.0, effectively 100 sec per host.

Two questions:
(1) Is there a better approach to deep-crawl large sites? Should we treat large sites differently from smaller sites? I notice Doug and Andrzej had discussed potential solutions to this problem. But, anybody has a good short-term solution?

(2) Will the dropped urls be picked up again in subsequent cycles of fetchlist/segment/fetch/updatedb? If this is true, running more cycles should eventually fetch the dropped urls. Does db.default.fetch.interval (default is 30 days) influence when the dropped urls will be fetched again?

Appreciate your advice.
AJ

Reply via email to