Nutch STOP conditions

brainstorm Fri, 22 Aug 2008 04:35:00 -0700

There's something left I want to ask that I haven't found clearly
explained on FAQ nor mailing list:


Nutch STOP conditions, meaning: "how to stop a running nutch crawl"

In other words, how to define crawl:

1) "time limit": Crawl for Q hours and stop
2) "segments limit": After generating N segments, stop
3) "space limit": After M megabytes/space on DFS used, stop.
4) "input urls limit": After crawling Z urls from the original (seed)
input set, stop.
5) "depth limit": After reaching crawling depth X "far away" from
original input url list, stop.

More "limits" doubts/suggestions are welcome ;)

I'll put the answer(s) on Nutch wiki (FAQ section) if you don't mind,
I think it could clarify this spot to lots of people on the mailing
list (me included ! :-S).

Nutch STOP conditions

Reply via email to