AJ Chen wrote:
Two questions:
(1) Is there a better approach to deep-crawl large sites?

If a site with N pages which require T seconds each on average to fetch, then fetching the entire site will require N*T seconds. If that's longer than you're willing to wait then you'll won't be able to fetch the entire site. If you are willing to wait, then set http.max.delays to Integer.MAX_VALUE and wait. In this case there's no shortcut.

(2) Will the dropped urls be picked up again in subsequent cycles of fetchlist/segment/fetch/updatedb?

They will be retried in the next cycle, up to db.fetch.retry.max.

Doug

Reply via email to