Hi,

During tests of crawling (with crawl command) big 1mln website HDD space was run out.

So I have
crawldb with 1 112 000 urls (112 000 urls were tested before)
segments with 40GB of data
index with partial data
/tmp/hadoop-root with 173GB of temporary hadoop data

After looking at mailing lists I understand that:
I can't restore crawl, hadoop map-reduce jobs and after adding some space just let it continue

If I can't what should I do?

delete /tmp/hadoop-root
delete crawldb urls (but we don't know when job failed and what urls to delete)
what to do with segments?

I asume that strategy for crawling should be divided into small "chunks" generate, fetch, update, linkdb, index but how to determine it with whole web crawling?


Thanks for any help,
Bartosz


Reply via email to