In vertical crawling, there are always some large sites that have tens
of thousands of pages. Fetching a page from these large sites very often
returns "retry later" because http.max.delays is exceeded. Setting
appropriate values for http.max.delays and fetcher.server.delay can
minimize this kind of url dropping. However, with my application , I
still see 20-50% urls got dropped from a few large sites even with
pretty long delay setting, http.max.delays=20, fetcher.server.delay=5.0,
effectively 100 sec per host.
Two questions:
(1) Is there a better approach to deep-crawl large sites? Should we
treat large sites differently from smaller sites? I notice Doug and
Andrzej had discussed potential solutions to this problem. But, anybody
has a good short-term solution?
(2) Will the dropped urls be picked up again in subsequent cycles of
fetchlist/segment/fetch/updatedb? If this is true, running more cycles
should eventually fetch the dropped urls. Does
db.default.fetch.interval (default is 30 days) influence when the
dropped urls will be fetched again?
Appreciate your advice.
AJ