You can limit the size of each segment by using the crawler's -topN
option. This will limit the number of URLs per segment.
You have to run multiple crawl cycles to fetch your 40k urls.
Note that if your documents produce new outlinks, they are put into the
crawldb after each cycle. The order in which they are fetched is
determined by the scoring plugin(s).
Btw, if you use a local filesystem you might able to recovery some of
the fetched data, see:
http://issues.apache.org/jira/browse/NUTCH-451
Mathijs
Harmesh, V2solutions wrote:
hi all,
I had run a crawl of approxmately 40,000 urls . It stop in between
giving an error of no disk available. Is there any way to restrict the size
of segements so that only a few MB goes in paticular segment .
thanks in advance.