how to deal with large/slow sites

AJ Chen Sun, 11 Sep 2005 13:51:57 -0700

In vertical crawling, there are always some large sites that have tensof thousands of pages. Fetching a page from these large sites very oftenreturns "retry later" because http.max.delays is exceeded. Settingappropriate values for http.max.delays and fetcher.server.delay canminimize this kind of url dropping. However, with my application , Istill see 20-50% urls got dropped from a few large sites even withpretty long delay setting, http.max.delays=20, fetcher.server.delay=5.0,effectively 100 sec per host.


Two questions:

(1) Is there a better approach to deep-crawl large sites? Should wetreat large sites differently from smaller sites? I notice Doug andAndrzej had discussed potential solutions to this problem. But, anybodyhas a good short-term solution?

(2) Will the dropped urls be picked up again in subsequent cycles offetchlist/segment/fetch/updatedb? If this is true, running more cyclesshould eventually fetch the dropped urls. Doesdb.default.fetch.interval (default is 30 days) influence when thedropped urls will be fetched again?


Appreciate your advice.
AJ

how to deal with large/slow sites

Reply via email to