Hi,
During tests of crawling (with crawl command) big 1mln website HDD space
was run out.
So I have
crawldb with 1 112 000 urls (112 000 urls were tested before)
segments with 40GB of data
index with partial data
/tmp/hadoop-root with 173GB of temporary hadoop data
After looking at mailing lists I understand that:
I can't restore crawl, hadoop map-reduce jobs and after adding some
space just let it continue
If I can't what should I do?
delete /tmp/hadoop-root
delete crawldb urls (but we don't know when job failed and what urls to
delete)
what to do with segments?
I asume that strategy for crawling should be divided into small "chunks"
generate, fetch, update, linkdb, index but how to determine it with
whole web crawling?
Thanks for any help,
Bartosz