Cleaning after job failed

Bartosz Gadzimski Wed, 18 Mar 2009 12:57:38 -0700

Hi,

During tests of crawling (with crawl command) big 1mln website HDD spacewas run out.


So I have
crawldb with 1 112 000 urls (112 000 urls were tested before)
segments with 40GB of data
index with partial data
/tmp/hadoop-root with 173GB of temporary hadoop data

After looking at mailing lists I understand that:

I can't restore crawl, hadoop map-reduce jobs and after adding somespace just let it continue


If I can't what should I do?

delete /tmp/hadoop-root

delete crawldb urls (but we don't know when job failed and what urls todelete)

what to do with segments?

I asume that strategy for crawling should be divided into small "chunks"generate, fetch, update, linkdb, index but how to determine it withwhole web crawling?



Thanks for any help,
Bartosz

Cleaning after job failed

Reply via email to